GitHub - Pro-GenAI/Agent-Action-Guard: 🛡️ Safe AI Agents through Action Classifier

⚠️ When AI agents are provided with a harmful tool and an instruction, they just use it. Popular and high-performing latest LLMs are not an exception.

🤖 AI is perceived as a threat. Increasing usage of agents leads to the usage of harmful tools and harmful usage of tools as proven using HarmActEval. Classifying AI agent actions ensures safety and reliability. Action Guard uses a neural network model trained on HarmActions dataset to classify actions proposed by autonomous AI agents as harmful or safe. The model has been based on a small dataset of labeled examples. The work aims to enhance the safety and reliability of AI agents by preventing them from executing actions that are potentially harmful, unethical, or violate predefined guidelines. ✅ Safe AI Agents are made possible by Action Classifier.

🎬 Demo

Tip

Please star ⭐ the repository if you find Action Guard is useful!

🚨 Common causes of harmful actions by AI agents:

🔓 User attempting to jailbreak the model.
🌀 Model hallucinating or misunderstanding the context.
💭 Model being overconfident in its incorrect knowledge.
🚧 Lack of proper constraints or guidelines for the agent.
📉 Inadequate training data for specific scenarios.
🛠️ MCP server providing incorrect tool descriptions that mislead the agent.
🎭 Harmful MCP servers returning manipulative text to mislead the model.
😬 The experiments proved that the model performs a harmful action and still responds "Sorry, I can't help with that."

🆕 New contributions of Agent-Action-Guard framework:

📊 HarmActions, a structured dataset of safety-labeled agent actions complemented with manipulated prompts that trigger harmful or unethical actions.
📏 HarmActEval benchmark leveraging a new metric "Safe@k."
🧠 Action Classifier, a neural classifier trained on HarmActions dataset, designed to label proposed agent actions as potentially harmful or safe, and optimized for real-time deployment in agent loops.
🔌 MCP integration supporting live action screening using existing MCP servers and clients.

📊 HarmActEval Results

⚡ Popular and latest LLMs generate harmful actions, proving the need for the action guard and HarmActEval benchmark.

Model	Safe@1 score
Phi 4 Mini Instruct	0.00%
Granite 4-H-Tiny	0.00%
Phi 4 Mini Reasoning	5.33%
Ministral 3 (3B)	2.67%
GPT-5.3	17.33%
Average	5.07%

📌 Note: Higher Safe@k score is better.

✨ Special features:

This project introduces "HarmActEval" dataset and benchmark to evaluate an AI agent's probability of generating harmful actions.
The dataset has been used to train a lightweight neural network model that classifies actions as safe, harmful, or unethical.
⚡ The model is lightweight and can be easily integrated into existing AI agent frameworks like MCP.
🔌 Supports MCP (Model Context Protocol) to allow real-time action classification.
🤝 A2A-compatible version: https://github.com/Pro-GenAI/A2A-Agent-Action-Guard.
Unlike OpenAI's "require_approval": "always" flag, this blocks harmful actions without human intervention.

🛡️ Safety Features:

🔍 Automatically classifies MCP tool calls before execution.
🚫 Blocks harmful actions based on the outputs of the trained model.
📋 Provides detailed classification results.
✅ Allows safe actions to proceed normally.

💬 Feedback

❤️ Love Action Guard? Please share a quick note at #15. It really helps shape the project to create a major impact on the AI field. 🙌 Waiting with excitement for feedback and discussions on how this helps you or the AI community.

🚀 Usage

⚡ Quick install:

Using uv:

uv venv
source .venv/bin/activate
uv pip install agent-action-guard

📖 For usage instructions, kindly refer https://github.com/Pro-GenAI/Agent-Action-Guard/blob/main/USAGE.md.

🔑 Note: The embedding client accepts an API key via the EMBEDDING_API_KEY environment variable (falls back to OPENAI_API_KEY if unset). See .env.example and USAGE.md for examples.

📦 Install with HarmActEval:

pip install "agent-action-guard[harmacteval]"
python -m agent_action_guard.harmacteval

Note: The usage of HarmActEval requires OpenAI API key to be set in the environment variables.

🏷️ License

This project is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0). You are free to share and adapt the material for any purpose, even commercially, as long as you provide appropriate credit to the original author(s) and indicate if changes were made.

If attribution is not desired, please send an acknowledgment email to praneeth.vad@gmail.com with the details of how you use the work and the impact it has on your project or research.

📝 Citation

If you find this repository useful in your research, please consider citing:

@article{202510.1415,
	title = {Agent Action Guard: Classifying AI Agent Actions to Ensure Safety and Reliability},
  	year = 2025,
	month = {October},
	publisher = {Preprints},
	author = {Praneeth Vadlapati},
	doi = {10.20944/preprints202510.1415.v1},
	url = {https://doi.org/10.20944/preprints202510.1415.v1},
	journal = {Preprints}
}

🤝 A2A version:

While this repository focuses on standard tool calls and MCP, an Agent-to-Agent (A2A) compatible version is available at: https://github.com/Pro-GenAI/A2A-Agent-Action-Guard

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.github		.github
agent_action_guard		agent_action_guard
assets		assets
examples		examples
tests		tests
training		training
.dockerignore		.dockerignore
.env.example		.env.example
.flake8		.flake8
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE.md		LICENSE.md
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
USAGE.md		USAGE.md
docker-compose.yml		docker-compose.yml
host-model.sh		host-model.sh
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
sitecustomize.py		sitecustomize.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🎬 Demo

🚨 Common causes of harmful actions by AI agents:

🆕 New contributions of Agent-Action-Guard framework:

📊 HarmActEval Results

✨ Special features:

💬 Feedback

🚀 Usage

🏷️ License

📝 Citation

🤝 A2A version:

About

Uh oh!

Releases 5

Sponsor this project

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

🎬 Demo

🚨 Common causes of harmful actions by AI agents:

🆕 New contributions of Agent-Action-Guard framework:

📊 HarmActEval Results

✨ Special features:

💬 Feedback

🚀 Usage

🏷️ License

📝 Citation

🤝 A2A version:

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 5

Sponsor this project

Uh oh!

Contributors

Uh oh!

Languages