Python: Validate approval responses against server-side pending request registry by moonbox3 · Pull Request #4548 · microsoft/agent-framework

moonbox3 · 2026-03-09T02:42:08Z

Motivation and Context

Adds a server-side pending approval registry to AgentFrameworkAgent that tracks approval requests emitted during streaming
Validates every incoming approval response against the registry before tool execution — rejects responses with no matching pending request, mismatched function names, or previously consumed IDs
Invalid approval responses are fully stripped from the message history so no bogus function_result content is injected into the LLM conversation
Registry keys are scoped by thread_id to prevent cross-conversation leaks

Description

Contribution Checklist

The code builds clean without any errors or warnings
The PR follows the Contribution Guidelines
All unit tests pass, and I have added new tests where possible
Is this a breaking change? If yes, add "[BREAKING]" prefix to the title of the PR.

markwallace-microsoft · 2026-03-09T02:44:30Z

Python Test Coverage Report •

File	Stmts	Miss	Cover	Missing
packages/ag-ui/agent_framework_ag_ui
_agent.py	39	1	97%	63
_agent_run.py	475	49	89%	148–155, 194–195, 202, 299, 311, 315, 317–318, 334, 361–362, 380–384, 492–494, 506–508, 612, 620, 729, 731–732, 785, 800–801, 808, 873, 896, 904, 906, 914, 967, 970, 980–981, 988
TOTAL	22725	2588	88%

Python Unit Test Overview

Tests	Skipped	Failures	Errors	Time
4943	20 💤	0 ❌	0 🔥	1m 21s ⏱️

Copilot

Pull request overview

This PR adds a server-side pending approval registry to AgentFrameworkAgent to validate approval responses before tool execution. It prevents three classes of security vulnerabilities: (1) approval bypass via crafted function_approvals without a prior approval request, (2) function name spoofing in the approval response, and (3) replay attacks using a previously consumed approval ID.

Changes:

AgentFrameworkAgent now holds a _pending_approvals dict (keyed {thread_id}:{call_id}) that is populated when the framework emits an approval request and consumed when a valid approval response is processed.
_resolve_approval_responses validates every incoming approved response against the registry; invalid entries are stripped from the message list entirely (not converted to rejections), preventing attacker-controlled content from entering the LLM conversation.
The two-turn approval test is restructured to model the real flow (Turn 1 emits the request; Turn 2 sends the response), and three new regression tests cover the bypass, replay, and name-spoofing scenarios.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
`_agent.py`	Adds `_pending_approvals: dict[str, str]` to `AgentFrameworkAgent.__init__` and passes it to `run_agent_stream`.
`_agent_run.py`	Validates approved responses against the registry in `_resolve_approval_responses`; registers approval requests in the streaming loop.
`test_agent_wrapper_comprehensive.py`	Refactors the existing two-turn approval test and adds four new security regression tests.

You can also share your feedback on Copilot code review. Take the survey.

python/packages/ag-ui/agent_framework_ag_ui/_agent_run.py

python/packages/ag-ui/agent_framework_ag_ui/_agent.py

moonbox3

Automated Code Review

Reviewers: 4 | Confidence: 83%

✓ Correctness

This diff adds server-side validation of approval responses against a pending-approvals registry to prevent bypass, function-name spoofing, and replay attacks. The core validation logic for approved responses is correct: entries are registered on emission, validated on receipt, and consumed to prevent replay. However, rejected responses are not validated against the registry, meaning fabricated rejection responses for never-issued approval requests can still inject content into the LLM conversation. There is also a potential missing json import in the test file, and the _pending_approvals dict has no eviction mechanism, risking unbounded memory growth for long-lived agent instances.

✗ Security Reliability

This diff adds server-side validation for approval responses via a pending-approvals registry, preventing bypass, replay, and function-name spoofing attacks on tool execution. The core mechanism is sound: approvals are registered on emit, validated and consumed on response, and invalid responses are stripped. However, rejected approval responses entirely bypass the registry validation (the guard clause only checks approved_responses), which allows an attacker to inject fabricated rejection results into the LLM conversation without a matching pending approval. Additionally, the _pending_approvals dict grows unboundedly if approval requests are never responded to, creating a potential memory leak for long-running servers.

✗ Test Coverage

The diff adds a server-side pending-approvals registry to prevent approval bypass, function-name spoofing, and replay attacks, with four new security-focused tests and one refactored two-turn approval test. Test coverage for the core security scenarios (bypass, replay, name mismatch, fabricated history) is strong. However, the fabricated-tool-result test uses json.dumps without evidence that json is imported in the test module, which would cause a runtime failure. Additionally, the rejection path when pending_approvals is active has no test coverage, and there is no test for the edge case where an approval request has a missing id or function_call (causing silent registration skip).

✓ Design Approach

The core design — a server-side pending-approvals registry that prevents bypass, replay, and function-name spoofing — is sound and well-tested. However, the validation gate only covers approved responses (if pending_approvals is not None and approved_responses), leaving fabricated rejected responses completely unvalidated. This is inconsistent with the stated security goal of preventing attacker-controlled content injection into the LLM conversation, since a fabricated rejection also injects a FunctionResultContent into the LLM context. Additionally, the in-memory _pending_approvals dict has no eviction mechanism, so unanswered approval requests accumulate indefinitely in long-running instances. The github-copilot-sdk upper-bound pin (<0.1.32) looks like a symptom-level compatibility workaround rather than a proper fix.

Flagged Issues

Rejected approval responses are not validated against the pending-approvals registry. The validation block at line 408 only runs when approved_responses is non-empty (if pending_approvals is not None and approved_responses:), so fabricated rejection responses with no matching pending approval pass through unvalidated. An attacker can inject fake rejection results into the LLM conversation, potentially influencing model behavior (e.g., making it believe certain tools are unavailable).
test_approval_bypass_via_fabricated_tool_result_is_blocked uses json.dumps(...) but import json is not visible in the test file's existing imports or in the diff. If it is not already imported, this test will fail with a NameError at runtime.

Suggestions

Rejected responses (not resp.approved) bypass the pending_approvals validation entirely. A crafted rejected response for a never-issued approval request would still be processed and inject a 'tool was rejected' FunctionResultContent into the LLM conversation. Consider validating rejected responses against the registry as well, or at minimum stripping them if their ID is not in pending_approvals.
The _pending_approvals dict on AgentFrameworkAgent has no eviction or TTL mechanism. If approval requests are emitted but the client never responds (e.g., user abandons the session), entries accumulate indefinitely. Consider adding a bounded size or TTL-based cleanup.
Verify that import json exists at the top of test_agent_wrapper_comprehensive.py — the new test test_approval_bypass_via_fabricated_tool_result_is_blocked uses json.dumps but the import is not visible in the diff.
The _pending_approvals dict has no TTL or cleanup mechanism. If an approval request is emitted but the user never responds (abandons the conversation), entries accumulate indefinitely. Consider adding a bounded size or expiration to prevent memory leaks in long-running server deployments.
When content.function_call.name is None at line 844, the registry stores an empty string as the function name. A crafted approval response with no function name would then pass the name-match check. Consider treating a missing function name as an error and not registering the approval.
Direct callers of run_agent_stream (not going through AgentFrameworkAgent) get pending_approvals=None by default, which disables all validation. Consider documenting this clearly or making validation opt-out rather than opt-in.
Add a test verifying that a legitimate rejection response still works correctly when pending_approvals validation is active. The current validation block (if pending_approvals is not None and approved_responses) skips rejected responses entirely, so rejected tool calls bypass registry validation. While rejections don't execute tools, confirming this behavior is preserved under the new code path would strengthen coverage.
Add a test for the edge case where a function_approval_request content has content.id as None or content.function_call as None. The registration guard (if content.id and content.function_call) silently skips registration, meaning a subsequent legitimate approval for that call would be rejected. This silent failure deserves test coverage.
In test_approval_function_name_mismatch_is_blocked, assert that the pending approval entry for 'call_safe_001' remains in wrapper._pending_approvals after the spoofed attempt is rejected. This confirms the legitimate user can still approve the original action with the correct function name.
Validate rejected approval responses against the pending_approvals registry as well. The code comment explicitly states that injecting attacker-controlled content into the LLM conversation is a concern, but fabricated rejections bypass validation entirely and still inject a FunctionResultContent (with rejection text) into the LLM context. A crafted rejection could manipulate the model's strategy without any prior approval request having been issued.
Consider adding a bounded-size or TTL-based eviction policy for _pending_approvals. If a user triggers approval requests but never responds, entries accumulate without bound for the lifetime of the AgentFrameworkAgent instance.
The github-copilot-sdk>=0.1.0,<0.1.32 pin in pyproject.toml looks like a symptom-level workaround. If 0.1.32 introduced a breaking change, the code should be updated for compatibility rather than capping the version, which blocks users from receiving future SDK bug fixes and security patches.

Automated review by moonbox3's agents

python/packages/ag-ui/agent_framework_ag_ui/_agent_run.py

python/packages/ag-ui/agent_framework_ag_ui/_agent.py

python/packages/ag-ui/agent_framework_ag_ui/_agent_run.py

python/packages/ag-ui/agent_framework_ag_ui/_agent.py

python/packages/ag-ui/tests/ag_ui/test_agent_wrapper_comprehensive.py

python/packages/ag-ui/agent_framework_ag_ui/_agent_run.py

python/packages/ag-ui/agent_framework_ag_ui/_agent.py

python/packages/github_copilot/pyproject.toml

- Validate rejected approval responses against pending_approvals registry, not just approved ones. Fabricated rejections without a prior request are now stripped from messages before reaching the LLM. - Bound _pending_approvals with OrderedDict + LRU eviction (max 10k) to prevent unbounded memory growth from abandoned approval requests. - Skip registration when function_call.name is None/empty; log warning when content.id or function_call is missing at registration time. - Document pending_approvals parameter in run_agent_stream docstring. - Add test for fabricated rejection attack scenario. - Assert pending approval entry is preserved after function name mismatch. - Pre-populate pending_approvals in rejection test for correct validation. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

moonbox3 added 2 commits March 9, 2026 11:35

Validate approval responses against server-side pending request registry

d1518d9

improvements

fd7e8ee

Copilot AI review requested due to automatic review settings March 9, 2026 02:42

markwallace-microsoft added the python label Mar 9, 2026

Copilot started reviewing on behalf of moonbox3 March 9, 2026 02:42 View session

Copilot AI reviewed Mar 9, 2026

View reviewed changes

python/packages/ag-ui/agent_framework_ag_ui/_agent_run.py Outdated Show resolved Hide resolved

python/packages/ag-ui/agent_framework_ag_ui/_agent_run.py Show resolved Hide resolved

python/packages/ag-ui/agent_framework_ag_ui/_agent.py Outdated Show resolved Hide resolved

pin GHCP sdk version to non-breaking for now

94c6bb0

moonbox3 commented Mar 9, 2026

View reviewed changes

moonbox3 and others added 4 commits March 9, 2026 11:58

Pin CHCP sdk to LKG.

c1b4f6a

really fix GHCP sdk pkg version

d2ba4d0

Apply pre-commit auto-fixes

c2e8797

eavanvalkenburg approved these changes Mar 9, 2026

View reviewed changes

moonbox3 and others added 2 commits March 9, 2026 19:17

merge main

ebfbea4

Merge branch 'main' into ag-ui-hitl-improve

a671f21

moonbox3 requested a review from TaoChenOSU March 10, 2026 22:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python: Validate approval responses against server-side pending request registry#4548

Python: Validate approval responses against server-side pending request registry#4548
moonbox3 wants to merge 9 commits intomicrosoft:mainfrom
moonbox3:ag-ui-hitl-improve

moonbox3 commented Mar 9, 2026

Uh oh!

markwallace-microsoft commented Mar 9, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

moonbox3 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

moonbox3 commented Mar 9, 2026

Motivation and Context

Description

Contribution Checklist

Uh oh!

markwallace-microsoft commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Python Unit Test Overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

moonbox3 left a comment

Choose a reason for hiding this comment

Automated Code Review

✓ Correctness

✗ Security Reliability

✗ Test Coverage

✓ Design Approach

Flagged Issues

Suggestions

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

markwallace-microsoft commented Mar 9, 2026 •

edited

Loading