Skip to main content

Reports

The Reports surface is the per-run breakdown — what happened on a single BulkSimulationReport, which simulations passed, which failed, what each evaluator scored, and how that maps back to the test cases in the dataset.

Two pages: the Reports list (every historical run) and the Individual Report (one run's detail).

Reports list

AI Trust Centre → Run history (the Reports list) — header shows total runs, pass rate, failed runs, and average duration; runs are grouped by day with columns for Run name, Environment (Sandbox/Staging), Triggered by, Triggered at, Duration, and Pass/Fail; per-row pass bar shows the proportion of passing simulations

Open from AI Trust Centre → Reports. Shows a table of every completed run:

ColumnNotes
Run nameAuto-generated from the dataset name + timestamp, editable.
Dataset(s)Which dataset(s) the run executed.
Started atWhen the run was queued.
DurationEnd-to-end wall-clock time.
Pass / Fail / TotalSimulation counts.
Trust ScoreThe score persisted at run completion (with its formula version).
SnapshotThe bot config version this run was executed against (links to the Version History entry).
Statusrunning / completed / failed / cancelled.

Click any row to open the Individual Report.

Individual Report

AI Trust Centre → Individual Report page — header shows run name, total/passed/failed counts, pass-rate, and a Retry failed shortcut alongside Export report and Fix with Nexus actions; the simulation row exposes Evaluators (with per-evaluator pass/fail badges and thresholds) and an Analyse conversation tab; the AI SUMMARY section calls out Strengths and Areas for improvement in plain English

A multi-pane layout for one run:

PaneWhat's there
SidebarFilter by pass/fail, by agent, by evaluator-failed-most, by category. Saved filter sets.
FilterThe active filter chip stack — what's currently narrowing the simulation list.
Simulation RulesThe evaluator + rule configuration that was active when the run executed. Captured at run time so the report is stable even if Evaluators & Rules are later re-tuned.
Evaluation RulesPer-evaluator threshold values used by the run.
SimulationsOne row per test case in the run: pass/fail badge, per-evaluator scores, latency, cost, link to the Test Case detail with the run-specific trace.
Saved settingsPersist a filter + view configuration as a saved view (e.g. "Quality-only failures, billing agent") — handy for recurring review sessions.

💡 Try Copilot Nexus: "Compare this run to yesterday's. Which test cases flipped from pass to fail, and what changed in the bot config between the two snapshots?"

How a Report becomes Issues

When a run completes, the issue-generation pipeline walks the simulations and clusters failures into Issues (surfaced in the Action Center when its v1 ships). The clustering today is dumb-and-predictable: (category × failureMode × component). The same failure pattern across N simulations becomes one Issue with frequency: N.

Reports never lose detail — even after failures have been rolled up into Issues, the per-simulation trace stays here under the simulation row.

Saved settings

The Saved settings modal on Individual Report lets you save a filter + sort configuration with a name. Use this when you have a recurring review session — e.g. "Friday afternoon: just look at Quality failures on the billing agent". Saved settings are per-user and live across runs (they re-apply against whatever run you open next).

Best practices

  • Start from the Individual Report when investigating a regression. Reports tell you exactly what happened in a single run — the fastest way in when you're chasing a specific pass→fail flip. (Once the Action Center ships, it complements this with the longer view of what's been broken across runs.)
  • Use Saved settings for recurring reviews. A team that triages once a week saves the "Critical + High failures, my agents" filter and uses it every Friday.
  • Don't compare runs that used different snapshots. A pass-rate drop between runs is only attributable to bot changes when both runs share a snapshot.
  • Re-run individual simulations from the row. No need to re-run the whole dataset to verify a single fix — the simulation row has a Re-run shortcut that queues just that test case.

💡 Try Copilot Nexus: "Which agents had the most regressions in this run? Group failures by agent and tell me which one to look at first."

Read next: Test Case — drill into a single case to pair its conversation with the execution trace that produced it.