Test your v3 Agent
Nexus gives you two complementary places to test:
- Playground — interactive, single-conversation testing on every agent's profile. Best while you build.
- AI Trust Centre — durable, dataset-driven evaluation. Best before each release. The Trust Centre has two sub-pages: Testing lab for cases and runs, and Evaluators & Rules for the scoring criteria.
Use both — they answer different questions.
Playground — try the bot as a user would
Inside any agent's profile, click the play (▶) icon in the title bar to open the Playground — a side panel on the left where you chat with the bot exactly as a customer would. This is your fastest feedback loop while building.
The Playground gives you:
- The bot's actual welcome message and quick-reply chips.
- A live chat input — type a message, hit Enter, watch the agent respond in real time.
- Per-message action icons (copy, voice playback, 👍 / 👎) for quick feedback.
- A voice/mic toggle for testing voice flows without leaving the page.
Don't confuse the Playground with Copilot. The right-hand "How can I help you today?" panel on an agent's profile is Copilot — an AI assistant for you, the builder. Use Copilot to ask questions about the bot's configuration, generate suggestions, or scaffold logic. The Playground is for acting as the end-user and seeing how your bot actually replies.

Step-by-step: try the bot
- Open any agent from AI Agent → Agents.
- Click the play (▶) icon in the title bar — the Playground opens on the left.
- Use one of the welcome quick-reply chips, or type a message and press Enter.
- Watch the agent respond. If voice is enabled for the bot, tap the speaker icon on a bot message to hear the TTS playback.
- Use 👍 / 👎 on individual messages to flag responses that looked good or bad — useful when you come back later to figure out what to fix.
Step-by-step: verify a routing rule fires
- Open the agent whose routing you want to test.
- Send a message that should match the rule (e.g. "I need a refund").
- Confirm the expected agent or tool takes over — the persona / response pattern should match.
- If the wrong thing fires, go back to Routing Logic and tighten the rule, or sharpen the agent's Trigger.
Step-by-step: verify a widget renders
- Make sure your bot has v3 agents enabled — widgets only render in v3 conversations.
- In the Playground, send a message that triggers a workflow node hosting your widget.
- The widget renders inline in the Playground chat. Interact with it (fill the form, click the button); the output flows back into the conversation as the next user input.
AI Trust Centre — durable evaluation
Open the AI Trust Centre group from the studio's left nav. It has two sub-pages:
| Sub-page | What it's for |
|---|---|
| Testing lab | Build test cases, run them as datasets, browse run history, replay failures. |
| Evaluators & Rules | Configure the scoring criteria every test run uses — Evaluators (continuous quality and safety scores) and Rules (hard invariants). |
Testing lab
The empty state shows "No test cases yet — Create your first test case to start evaluating your AI agent." From here, the top toolbar gives you Search test cases, Run history, and New test case. Run history lists every dataset run with pass / fail counts; clicking a run drills into per-case results.

Step-by-step: build a regression dataset
- Open AI Trust Centre → Testing lab.
- Click New test case. The drawer has two tabs:
- Trial run — drive a live conversation with the bot, then approve what you saw as a test case. Best when you already know how the bot should behave for a given prompt and want to lock it in.
- Scenario — describe what the test case should check (agent goal, rules, inputs, optional title hint) and click Generate with AI. The LLM drafts a short user ↔ agent conversation plus 1–3 expected behaviours; edit any field before saving.
- For each scenario you want to cover, the saved case ends up with:
- A clear name ("billing question routes correctly", "fallback fires on gibberish").
- The user inputs — the messages a real user would send.
- Any initial state the test needs (e.g. logged-in customer ID).
- The expected outcome or behaviour.
- Aim for 10–30 cases first time. Cover the golden path for each major journey, one case per Routing Logic rule, one case per tool, and 2–3 adversarial prompts.
- Run the dataset from the Testing lab toolbar. Failures collect under Run history.
Assertion picker (trial-run flow)
When you use the Trial run flow to capture a test case, the backend runs the conversation, captures the trace, and proposes a set of suggestedAssertions — checks that would have passed against the trace you just produced (e.g. "getOrderStatus tool was called on turn 2", "final response contained the tracking number").
Before saving, the Assertion picker lets you tick which suggestions to ship onto the saved test case:
- Turn-anchored suggestions render inline beside the matching agent message, with the same card chrome as the trace events panel.
- Non-turn-anchored suggestions live in a panel below the conversation.
- Internal
$$memory paths and trivialmax_turnssuggestions are filtered out by default (you can still pick them if you want to lock them in).
Only the assertions you check are written to the test case — everything else is treated as informational signal, not a hard pass/fail criterion.
Inline traces on the test case detail panel
Open any saved test case and switch to the Conversation tab. Each agent message is paired with the execution trace that produced it — tool calls, memory updates, per-turn metrics — captured at the time the test case was created.
If a test case predates the baseline-capture flow (v2 agents, legacy testcases), you'll see the plain bubble list without trace pairing.
Manage tests from the agent detail page
You don't have to leave the agent surface to work with tests for it. Each agent's detail page has a Tests button in the topbar (next to Version History) that opens a popover with:
- The number of test cases linked to this agent.
- The latest run's pass / fail summary.
- A live progress bar for any in-flight run (auto-polls every 5 seconds).
- An AI-generation indicator when scenarios are being drafted.
- Quick links to the Testing Lab (deep-link pre-filters the table by this agent and auto-selects its tests, ready for a one-shot "Run Tests" click) and to the Run detail page.
When you start a run from the popover, the change broadcasts across open tabs — the Testing Lab table refreshes everywhere without a reload. The popover also exposes a Create test case shortcut that opens the standard drawer pre-populated with this agent as the source.
This is purely a navigation / surfacing convenience — the test cases themselves still live under the AI Trust Centre → Testing Lab.
Evaluators & Rules
This sub-page is where you tune what counts as a pass. The page has two tabs — Evaluators and Rules — plus a stats header showing how many evaluators are enabled, how many are Quality vs Safety checks, and the average threshold across them.
Evaluators score every response automatically. They're grouped into two categories:
| Category | What's measured |
|---|---|
| Quality (7 checks today) | Empathy, Accuracy / Quality Score, Response Variability, Strictness, Hallucination, Clear Communication, Follow-up Handling. |
| Safety (3 checks today) | Language Filter (toxicity & bias) plus other safety guardrails. |
Each evaluator has an on/off toggle, a threshold slider (0–100, lower = more permissive), and an expand arrow with a one-line description of what the check measures. Reset to defaults in the header rolls back any tuning.

Rules (the second tab) let you define hard invariants the bot must always (or never) satisfy. Example: "The bot must never quote pricing." Rules are evaluated alongside the evaluators on every run — if a rule fires, the run flags it.
Step-by-step: configure scoring
- Open AI Trust Centre → Evaluators & Rules.
- On the Evaluators tab, disable any check that doesn't apply (e.g. switch off Empathy for a transactional bot).
- Tune thresholds — lower = more permissive, higher = stricter. The slider value is the minimum score a response must reach to be considered a pass on that check.
- Switch to the Rules tab. Add any invariants the bot must respect, regardless of the test case.
- Save. Future Testing lab runs use the new scoring.
Common testing pitfalls
- Testing only the golden path. The hard cases are where bugs hide. Schedule time for adversarial testing — jailbreaks, off-topic, hostile users.
- Forgetting to test after rules change. Even a small wording tweak in identity, conversation rules, or routing logic can shift behaviour.
- Trusting "it worked once." LLMs are stochastic. Run the same test twice — if a behaviour is fragile, it'll fail intermittently.
- Not saving test cases. A test you ran manually once is one you'll have to re-run manually next time. Save it to the Testing lab dataset.
Best practices
- Test in the Playground first, then promote durable cases to a Testing lab dataset.
- Tune Evaluators once, save it, and let them score every future run automatically. Don't eyeball runs every time — that's what evaluators are for.
- Treat the regression dataset as production code. Review it, evolve it, don't let it rot.
- Test voice and chat separately — they don't behave identically, even with the same agent config.
Go to Widget Builder if you need custom UI in your conversations.