Tidal: Tackling Concept Drift in Provenance-based Advanced Persistent Threats Detection

Requirements

Python 3.8+
PyTorch >= 1.10.0 (CUDA recommended)
pandas
numpy
tqdm
nltk
scikit-learn
matplotlib

Install all dependencies:

pip install torch pandas numpy tqdm nltk scikit-learn matplotlib

Usage

1. Process Raw DARPA Data

Convert raw DARPA TC3 audit logs into the structured format (srcUUID dstUUID action target timestamp):

2. Merge Sketch Data

Combine GraphChi sketch vectors with the processed logs:

cd graph_data_process
python map_sketch_raw_data.py

3. Split Train and Test

4. Generate Token Tree

Extract unique objects/actions and build the semantic token tree:

cd generate_token_tree
python nlp_stem.py > object_data/cadet_object.txt
python tree_from_path.py > object_data/cadet_tree.txt

5. Tokenize Logs

Tokenize the log data and resample for class balance:

cd tokenize_train_test
python token_and_resample.py --semi_data_path='../data/cadet/full_data/graph/graph_data_sampled/'

For semi-supervised learning with pseudo-labels:

python semi_token_and_resample.py --semi_data_path='../data/cadet/full_data/graph/graph_data_sampled/'

6. Train the Model

Configure the dataset and model in train/config.py:

dataset_dir — path to the dataset
train_corpus_file_paths / test_corpus_file_paths — JSON filenames
vocab_size — set to match the token tree size
model_name — name for saved checkpoints

Then train:

cd train
python train.py

7. Evaluate

Run threshold-based evaluation and generate metrics:

python test_log.py      # Threshold sweep, outputs results CSV
python test_roc.py      # ROC and precision-recall curves

Model Architecture

The classifier follows an encoder-only transformer design:

Token Embedding — maps tokenized log events to dense vectors
Positional Encoding — injects sequence position information
Transformer Encoder — multi-head self-attention over the log sequence
Pooling — sum/average over sequence length
Graph Embedding Fusion — optional concatenation of node2vec embeddings
Classifier Head — binary output (benign vs. attack)

Data Format

Raw log format (tab-separated):

srcUUID    dstUUID    action    target    timestamp

Training JSON ([x, y, z]):

x — list of token sequences (list of int lists)
y — binary labels (0 = benign, 1 = attack)
z — node2vec graph embeddings per sequence

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
generate_token_tree		generate_token_tree
graph_data_process		graph_data_process
tokenize_train_test		tokenize_train_test
train		train
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tidal: Tackling Concept Drift in Provenance-based Advanced Persistent Threats Detection

Requirements

Usage

1. Process Raw DARPA Data

2. Merge Sketch Data

3. Split Train and Test

4. Generate Token Tree

5. Tokenize Logs

6. Train the Model

7. Evaluate

Model Architecture

Data Format

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

Tidal: Tackling Concept Drift in Provenance-based Advanced Persistent Threats Detection

Requirements

Usage

1. Process Raw DARPA Data

2. Merge Sketch Data

3. Split Train and Test

4. Generate Token Tree

5. Tokenize Logs

6. Train the Model

7. Evaluate

Model Architecture

Data Format

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages