- Python 3.8+
- PyTorch >= 1.10.0 (CUDA recommended)
- pandas
- numpy
- tqdm
- nltk
- scikit-learn
- matplotlib
Install all dependencies:
pip install torch pandas numpy tqdm nltk scikit-learn matplotlibConvert raw DARPA TC3 audit logs into the structured format (srcUUID dstUUID action target timestamp):
Combine GraphChi sketch vectors with the processed logs:
cd graph_data_process
python map_sketch_raw_data.pyExtract unique objects/actions and build the semantic token tree:
cd generate_token_tree
python nlp_stem.py > object_data/cadet_object.txt
python tree_from_path.py > object_data/cadet_tree.txtTokenize the log data and resample for class balance:
cd tokenize_train_test
python token_and_resample.py --semi_data_path='../data/cadet/full_data/graph/graph_data_sampled/'For semi-supervised learning with pseudo-labels:
python semi_token_and_resample.py --semi_data_path='../data/cadet/full_data/graph/graph_data_sampled/'Configure the dataset and model in train/config.py:
dataset_dir— path to the datasettrain_corpus_file_paths/test_corpus_file_paths— JSON filenamesvocab_size— set to match the token tree sizemodel_name— name for saved checkpoints
Then train:
cd train
python train.pyRun threshold-based evaluation and generate metrics:
python test_log.py # Threshold sweep, outputs results CSV
python test_roc.py # ROC and precision-recall curvesThe classifier follows an encoder-only transformer design:
- Token Embedding — maps tokenized log events to dense vectors
- Positional Encoding — injects sequence position information
- Transformer Encoder — multi-head self-attention over the log sequence
- Pooling — sum/average over sequence length
- Graph Embedding Fusion — optional concatenation of node2vec embeddings
- Classifier Head — binary output (benign vs. attack)
Raw log format (tab-separated):
srcUUID dstUUID action target timestamp
Training JSON ([x, y, z]):
x— list of token sequences (list of int lists)y— binary labels (0= benign,1= attack)z— node2vec graph embeddings per sequence