Skip to main content

Evaluators & Rules

AI Trust Centre → Evaluators & Rules is where you tune what counts as a pass on every run. Two tabs:

  • Evaluators — continuous quality and safety scores, computed for every agent response.
  • Rules — hard invariants the bot must always (or never) satisfy.

A stats header summarises the current configuration:

Header fieldMeaning
Evaluators enabledN / 10 — how many of the available evaluators are turned on.
Quality checksCount of enabled Quality-category evaluators.
Safety checksCount of enabled Safety-category evaluators.
Avg thresholdMean of the threshold sliders across enabled evaluators.

A Reset to defaults button in the header rolls back any tuning.

Evaluators & Rules page on the Evaluators tab — header shows 10/10 evaluators enabled, 7 Quality checks, 3 Safety checks, and an average threshold; the Quality section lists Empathy, Accuracy/Quality Score, Response Variability, Strictness, Hallucination, Clear Communication, and Follow-up Handling each with description, threshold slider, and on/off toggle; the Safety section lists Language Filter (Toxicity & Bias), PII Detector, and Jailbreak Evaluator

The 10 evaluators

Each evaluator has an on/off toggle, a threshold slider (0–100, lower = more permissive), and an expand arrow with a one-line description.

Quality (7 checks)

EvaluatorWhat it measures
EmpathyHow well the bot acknowledges user concerns and emotional state.
Accuracy / Quality ScoreWhether the bot resolves issues correctly and provides correct information.
Response VariabilityDiversity of phrasing across the conversation — prevents repetitive answers.
StrictnessAdherence to configured instructions and guidelines.
HallucinationPenalty for fabricating information (lower score = fewer hallucinations).
Clear CommunicationNatural, non-repetitive, human-like phrasing.
Follow-up HandlingHow the bot handles follow-up questions and maintains context across turns.

Safety (3 checks)

EvaluatorWhat it measures
Language Filter (Toxicity & Bias)Detects toxic, biased, or unsafe language in the bot's responses.
PII DetectorScans for leaked sensitive info (emails, SSNs, phone numbers, etc.) in the bot's responses.
Jailbreak EvaluatorAssesses the bot's resistance to prompt injections and bypass attempts.

How thresholds map to pass / fail

The slider value is the minimum score a response must reach to be considered a pass on that evaluator. Lower the threshold to be more permissive (more responses pass); raise it to be stricter.

The threshold is per-evaluator, not global. A bot can pass Accuracy at 80 while failing Hallucination at 35 — both contribute to the run's overall pass rate independently.

💡 Try Copilot Nexus: "Recommend evaluator thresholds for a high-stakes financial-services bot — I need stricter Hallucination and Accuracy, more permissive on Empathy."

How evaluators roll up into category scores

The 10 evaluators roll up into seven category pills that summarise per-area performance for any given run:

PillMaps to evaluators (today)
SafetyLanguage Filter (Toxicity & Bias) + PII Detector + Jailbreak Evaluator.
WorkflowWorkflow execution correctness — surfaces tool / workflow misuse.
RAGKnowledge-base retrieval and grounding signal.
RoutingWhether the right agent / tool was picked given the user message.
QualityEmpathy + Accuracy + Clear Communication + Follow-up Handling.
ToolWhether tools were called with valid arguments and their outputs were used.
PerformanceLatency and cost signal across the run.

Today the breakdown is shaped by the available evaluators; as new evaluators are added to the catalog, they'll be slotted into the matching pill.

Rules — hard invariants

The Rules tab is for things the bot must always (or never) do, regardless of the test case. Rules are evaluated alongside the evaluators on every run — if a rule fires (violated), the run flags it.

Examples:

  • "The bot must never quote pricing."
  • "If the user mentions a competitor, the bot must not respond with feature comparisons."
  • "Every response that mentions an order must include the order ID."

Rules use the same @-mention picker as Routing Logic for referencing agents, tools, or workflows.

💡 Try Copilot Nexus: "Write three Rules that prevent the bot from sharing personal data (PII) even when asked directly."

Step-by-step: configure scoring

  1. Open AI Trust Centre → Evaluators & Rules.
  2. On the Evaluators tab, disable any check that doesn't apply (e.g. switch off Empathy for a purely transactional bot).
  3. Tune thresholds. Lower = more permissive; higher = stricter. Hover the slider for the suggested baseline.
  4. Switch to the Rules tab. Add any invariants the bot must respect, regardless of test case.
  5. Save. Future Testing Lab runs use the new scoring.

Best practices

  • Tune once, then leave it. Stable scoring is more useful than precise scoring — if you re-tune thresholds before every release you can't compare runs.
  • Disable rather than relax. If Empathy is irrelevant for a tax-filing bot, turn it off. Leaving it on at threshold 5 produces noise.
  • Pair Evaluators with Rules. Evaluators tell you how well the bot did; Rules tell you what it must never do. Most bots need both — e.g. "be empathetic" (evaluator) and "never recommend a competitor" (rule).
  • Reset to defaults if you've lost the plot. If thresholds have been tuned by three different people over three months, defaults are a clean baseline.

💡 Try Copilot Nexus: "My latest run failed Hallucination at threshold 35 but passed Accuracy at 80. Are those signals contradictory or complementary? What should I look at?"

Read next: Test Case — open a failing case and pair its conversation with the trace that produced it.