Skip to content

Froot-NetSys/TIDAL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tidal: Tackling Concept Drift in Provenance-based Advanced Persistent Threats Detection

Requirements

  • Python 3.8+
  • PyTorch >= 1.10.0 (CUDA recommended)
  • pandas
  • numpy
  • tqdm
  • nltk
  • scikit-learn
  • matplotlib

Install all dependencies:

pip install torch pandas numpy tqdm nltk scikit-learn matplotlib

Usage

1. Process Raw DARPA Data

Convert raw DARPA TC3 audit logs into the structured format (srcUUID dstUUID action target timestamp):

2. Merge Sketch Data

Combine GraphChi sketch vectors with the processed logs:

cd graph_data_process
python map_sketch_raw_data.py

3. Split Train and Test

4. Generate Token Tree

Extract unique objects/actions and build the semantic token tree:

cd generate_token_tree
python nlp_stem.py > object_data/cadet_object.txt
python tree_from_path.py > object_data/cadet_tree.txt

5. Tokenize Logs

Tokenize the log data and resample for class balance:

cd tokenize_train_test
python token_and_resample.py --semi_data_path='../data/cadet/full_data/graph/graph_data_sampled/'

For semi-supervised learning with pseudo-labels:

python semi_token_and_resample.py --semi_data_path='../data/cadet/full_data/graph/graph_data_sampled/'

6. Train the Model

Configure the dataset and model in train/config.py:

  • dataset_dir — path to the dataset
  • train_corpus_file_paths / test_corpus_file_paths — JSON filenames
  • vocab_size — set to match the token tree size
  • model_name — name for saved checkpoints

Then train:

cd train
python train.py

7. Evaluate

Run threshold-based evaluation and generate metrics:

python test_log.py      # Threshold sweep, outputs results CSV
python test_roc.py      # ROC and precision-recall curves

Model Architecture

The classifier follows an encoder-only transformer design:

  1. Token Embedding — maps tokenized log events to dense vectors
  2. Positional Encoding — injects sequence position information
  3. Transformer Encoder — multi-head self-attention over the log sequence
  4. Pooling — sum/average over sequence length
  5. Graph Embedding Fusion — optional concatenation of node2vec embeddings
  6. Classifier Head — binary output (benign vs. attack)

Data Format

Raw log format (tab-separated):

srcUUID    dstUUID    action    target    timestamp

Training JSON ([x, y, z]):

  • x — list of token sequences (list of int lists)
  • y — binary labels (0 = benign, 1 = attack)
  • z — node2vec graph embeddings per sequence

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages