Skip to content

Build the evaluation harness, scorecards, and policy-aware audit plus redaction flow #67

@KSemenenko

Description

@KSemenenko

Problem

Operators need a usable trust layer that combines transcript evaluation, readable scorecards, and safe audit exports with redaction control.

Scope

  • Track evaluation harness inputs, transcript scoring, scorecard history, audit exports, and redaction policy behavior
  • Cover how quality and trust state appears per agent profile or session
  • Keep sensitive data handling explicit

Out of scope

  • Low-level telemetry emission that belongs in earlier observability issues
  • Toolchain installation and provider health flows

Implementation notes

  • Make scorecards and audit exports actionable to operators
  • Align redaction with provider secrets, prompt data, tool arguments, and results
  • Keep the issue compatible with the official evaluation packages

Definition of Done

  • The issue defines the minimum trust surfaces: harness, scorecards, audit, and redaction
  • Later implementation can proceed without re-deciding how evaluations surface in the product

Verification

  • Review the issue against the feature spec, evaluation-package issue, and approval flows

Dependencies

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions