What is simulation-based evaluation?

Instead of scoring a static dataset, Arklex creates the test data for you. It generates multi-turn conversations between synthetic users and your agent, then evaluates how the agent handled each turn. The result is coverage for failure modes you would not catch with single-turn benchmarks.

How is this different from other evaluation tools?

Most tools need you to bring your own test conversations. Arklex generates them. That means you can test for scenarios that have not happened in production yet, including edge cases where users push back, change their mind, or ask unexpected follow-ups.

Why does multi-turn testing matter?

An agent can ace a single question and still fall apart in a real conversation. Context gets lost by turn five. Tool calls break when the user changes direction. The agent contradicts something it said two turns ago. These are the failures that reach production, and they only show up when you test across multiple turns.

What agents and frameworks are supported?

Any agent, any framework. If it exposes an HTTP endpoint, speaks the A2A protocol, or is a Python class, Arklex can test it. The platform handles the simulation and evaluation regardless of how your agent is built.

Can I integrate this into my development workflow?

Arklex works as a CI/CD quality gate that runs on every code change, and as a standalone platform for testing, governance, and deployment approval. Teams typically start with ad-hoc testing during development and add CI gates once they have a baseline.

Workspaces are fully isolated with separate data storage. The platform can run on your infrastructure, keeping all conversations and evaluation data in your environment. Private cloud deployment is available for enterprise customers.

LLM

Can You Trust Your LLM-as-Judge? Closing the Loop from Annotation to Alignment

Zhou YuCo-founder & CEO

July 2, 2026

Can You Trust Your LLM-as-Judge? Closing the Loop from Annotation to Alignment

Every team building AI agents eventually adopts an LLM-as-judge. It's the only way to evaluate thousands of conversations without hiring an army of reviewers. But in nearly every conversation we have with agent builders, the same concern comes up:

"How do I know my judge is actually right?"

It's a fair question. An unverified judge is worse than no judge at all — it gives you confident numbers that may be measuring the wrong thing. A judge that passes a broken agent is a false green light. A judge that fails a good agent sends your engineers chasing ghosts.

At Arklex, we've built the answer directly into ArkSim: a workflow where your product managers and subject matter experts annotate judge decisions, and those annotations automatically update the judge — with guardrails that guarantee the judge gets better, not worse. Here's how we think about the problem, and the design principles behind our solution.

The trust gap: why out-of-the-box judges disagree with your experts

An off-the-shelf LLM judge evaluates against generic best practices. Your experts evaluate against domain-specific standards — business policies, tone requirements, hard-won lessons from production incidents. The judge isn't wrong in the abstract; it just doesn't know your definition of quality.

A customer support bot's response can be factually correct and still fail your standards because it opened without empathy. A SQL agent's query can be syntactically perfect and still be wrong because it queried raw tables instead of your certified views. No general-purpose judge knows these rules until someone teaches it.

The only ground truth for "is my judge right?" is agreement with your human experts. Which means the path to a trustworthy judge runs through human annotation — and most teams get stuck exactly there.

Why annotation efforts stall

The traditional playbook — export transcripts to a spreadsheet, ask SMEs to label hundreds of rows, then prompt-engineer or fine-tune the judge — fails for predictable reasons:

Annotators lack context. A row in a spreadsheet showing one agent response can't tell you whether that response was appropriate. Was the user frustrated by turn six? Did the agent already promise something in turn two? What did the retrieval step actually return? Without the full picture, annotations are guesses, and inter-annotator agreement collapses.

The tooling wasn't built for PMs and SMEs. The people who know what "good" looks like are rarely the people comfortable with JSON dumps and eval scripts. Every ounce of friction shrinks your labeled dataset.

Labels are information-sparse. A thumbs-down tells the judge that it was wrong, not why. It can take dozens of contrastive examples to implicitly teach a rule that one sentence of expert feedback makes explicit: "Escalations mentioning refunds must always be routed to billing, even if the user asks a product question."

Nothing closes the loop. Even teams that collect annotations often let them sit in a spreadsheet. The judge never learns, so the disagreements never shrink, so annotating starts to feel pointless — and stops.

Design principle 1: give annotators enough context to make annotation easy

The single highest-leverage decision in annotation tooling is what the annotator sees. In ArkSim, when a PM or SME reviews a judge decision, they get:

The full conversation, not an isolated turn. Multi-turn agents fail in multi-turn ways — losing context, contradicting earlier responses, calling the wrong tool after a topic shift. The annotator sees the whole trajectory, with the turn under review highlighted in place.

The judge's own verdict and rationale. Reviewing a judgment is far faster than producing one from scratch. The annotator's job becomes "agree or correct," not "evaluate from zero." When they disagree, the judge's stated reasoning shows them exactly which assumption to correct.

The evaluation criterion, in plain language. Annotators shouldn't have to remember what "faithfulness" means in your rubric. The definition is right there, so different annotators apply the same standard.

A place to say why. Alongside the corrected label, annotators write a sentence of natural-language feedback. This is deliberately the centerpiece of the interface, because that sentence is worth more than the label itself — it's the rule, stated explicitly, ready to be taught.

The result: annotation sessions that used to require an engineer preparing data exports become something a PM does in the product, in minutes, between meetings.

Design principle 2: annotations must actually update the judge

Collecting feedback that goes nowhere kills the program. In ArkSim, annotation and alignment are the same workflow: you review a batch of judge decisions, correct the ones that are wrong, explain why — and update the judge with one click.

Under the hood, we build on MemAlign, the judge-alignment framework from Databricks Research. MemAlign aligns a judge through a dual-memory system rather than prompt surgery or fine-tuning:

Semantic memory stores generalizable principles distilled from feedback — "evaluate safety based on intent, not just language."
Episodic memory stores the specific hard cases where the judge stumbled, retrieved as anchors when similar inputs appear.

This architecture is why the loop is fast enough to be interactive. Alignment runs in seconds, not the minutes-to-hours of prompt optimizers, and shows measurable improvement from just a handful of examples. Your SME can correct a judgment, explain the reasoning, and watch the judge improve in the same session. And because memory entries are inspectable, you can see exactly what the judge has learned — and delete or overwrite guidance when your standards evolve.

Design principle 3: prove the judge got better, not worse

An alignment step that silently degrades your judge is scarier than no alignment at all. Every judge update in ArkSim is treated like a code deploy: it must pass checks before it ships.

We hold out a slice of your annotations as a verification set the judge never learns from, and measure agreement before and after each update. We also re-run the judge on previously corrected cases to catch regressions — the classic failure where fixing one error re-breaks another. If agreement drops on either front, the update is flagged instead of applied, and you can inspect exactly which memory entry caused the conflict. Alignment only moves forward when the numbers say it should.

Over time, this gives you the artifact that answers the original trust question: a tracked, monotonically improving agreement rate between your judge and your experts, on data the judge has never seen.

Adapting MemAlign to conversations: annotations aren't independent

MemAlign was designed for largely independent examples — one input, one judgment, one piece of feedback. Agent evaluation isn't like that. ArkSim evaluates conversations, and annotations on turns within the same conversation are correlated: they share the same synthetic user, the same context, and often the same root cause. If the agent loses the user's account number at turn three, turns four through eight may all be judged failures — but that's one lesson, not five.

Naively feeding turn-level annotations into alignment causes two problems, and we adapted MemAlign to address both:

Leakage in verification. If turns from one conversation land in both the training and held-out sets, the judge is being verified on data it has effectively seen, and your agreement metrics flatter you. ArkSim splits at the conversation level, never the turn level, so verification numbers reflect true generalization.

Redundancy in memory. Five correlated corrections from one conversation aren't five independent lessons. Treating them as such over-weights a single incident in semantic memory and stuffs episodic memory with near-duplicates that crowd out diverse examples at retrieval time. We deduplicate and down-weight within-conversation annotations during memory distillation, so one bad conversation contributes one principle and its single most instructive anchor — not five echoes of the same mistake.

We also preserve conversation context in episodic memory. A turn-level correction is stored with the conversational state that made it wrong, so retrieval matches on the situation — "user has already been promised a refund" — rather than surface similarity of a single message.

The loop, end to end

Put together, judge development in ArkSim looks like this: simulate conversations with synthetic users, let the judge score every turn, have your PMs and SMEs review the judgments with full context and explain their corrections, align the judge in seconds, and verify on held-out conversations that agreement went up before the update ships. Then repeat — each cycle shrinking the gap between what your judge measures and what your experts mean by quality.

That's what it takes to move an LLM-as-judge from "a number we hope is right" to "a metric we've verified against our own experts." Your judge is only as good as your ability to check it — so we made checking it, and fixing it, part of the product.

Want to see the annotation and alignment loop on your own agent? Try ArkSim