A sophisticated cybersecurity tool designed to identify supply-chain attacks in NPM and PyPI ecosystems using static analysis, behavior sequences, and vector-based retrieval.
This tool moves beyond simple regex-based heuristic scanners by:
- Parsing source code into AST (Abstract Syntax Tree).
- Extracting high-level behavior sequences.
- Comparing behaviors against known malicious patterns using RAG (Retrieval-Augmented Generation) with vector similarity.
- Reasoning about the risk using a Simulated LLM that generates high-fidelity explanations.
parser/: AST parsing for Python/JS and behavioral extraction logic.rag/: Vector database and known malicious pattern storage.llm/: Simulated reasoning engine for risk scoring and explanations.detector/: Core orchestration engine.utils/: Registry downloaders and file system helpers.main.py: Interactive CLI with rich terminal formatting.
-
Environment (Python 3.10+):
pip install -r requirements.txt
-
Manual Model Download (Optional but recommended): If using
all-MiniLM-L6-v2for the first time, it will automatically download from HuggingFace upon first run.
python main.py express --registry npmpython main.py requests --registry pypi# Scan a simulated malicious example
python main.py --local ./tests/malicious_example
# Scan a simulated safe example
python main.py --local ./tests/safe_exampleuvicorn server:app --reload- AST Extraction: The tool parses
.pyand.jsfiles. It looks for sensitive API calls likeos.system,fetch,base64.decode, andprocess.env. - Behavior Mapping: Raw tokens are converted into high-level behaviors:
CALL_OS.SYSTEM->SHELL_EXECUTIONCALL_REQUESTS.POST+CALL_ENVIRON->EXFILTRATION_RISK
- Vector RAG: These sequences are vectorized and compared against
rag/patterns.jsonusing cosine similarity. - Simulated LLM Reasoning: The analyzer evaluates the combination of behaviors. For example, a network call alone is fine, but a network call combined with environment variable access and base64 encoding triggers a MALICIOUS verdict.
{
"package_name": "malicious-pkg",
"registry": "npm",
"behaviors": ["IMPORT_OS", "CALL_SUBPROCESS.RUN", "NETWORK_REQUEST"],
"behavior_description": "this code imports sensitive module and executes shell commands...",
"rag_match": {
"pattern": { "threat": "Reverse Shell", "description": "Spawns a remote shell..." },
"score": 0.85
},
"analysis": {
"verdict": "MALICIOUS",
"score": 95,
"reasoning": "AI ANALYSIS REPORT: ...",
"confidence": "High",
"indicators": [["SHELL_EXECUTION", 45], ["NETWORK_REQUEST", 20]]
}
}- Detected Behaviors: SHELL_EXECUTION, NETWORK_REQUEST, ENV_VARIABLE_ACCESS, DATA_ENCODING.
- RAG Match: "Data Exfiltration (Environment Variables)".
- Verdict: MALICIOUS (Score: 85+)
- Obfuscation: Advanced malware may use dynamic code generation (
evalon base64) to hide itself. - Contextual Analysis: Some legitimate DevOps tools (like AWS SDK) use similar behaviors; they require higher confidence thresholds.
- Future: Support for Rust/C++ extensions and dynamic sandboxing.