Can You Trust Your LLM-as-Judge? Closing the Loop from Annotation to Alignment
Zhou Yu|Latest insights, and updates from the Arklex team

Every team building AI agents eventually adopts an LLM-as-judge. It's the only way to evaluate thousands of conversations without hiring an army of reviewers. Bu…
Zhou Yu|
We've spent a lot of time discussing how to evaluate AI agents, but voice agents present a distinct class of challenges that go well beyond what standard agent e…
Zhou Yu|
Most AI voice systems work in steps: you speak, the AI thinks, then it responds. That gap — even if it's just a second or two — gives safety systems time to chec…
Zhou Yu|
We recently tested Klarna's AI customer support chatbot using a web agent simulation framework that generates realistic, multi-turn shopping and support conversa…
Andy Yao|
We recently ran a series of tests on Amazon’s Rufus agent using a Web Agent simulation tool. The goal was simple: evaluate how well Rufus performs in a realistic…
Andy Yao|
TL;DR — If you own agent quality, here are three conversation shapes that will break the chat layer of every framework you're considering. We simulated 800 adver…
Arbit Chen|
A scenario you'll recognize Wednesday, 9:40 PM. A PM drops a message in Slack: "This flow is still broken. I just reproduced the same issue by phrasing it sli…
Yi Ju|
Traditionally, you first collect data to train your ML model. In the LLM era, you no longer need data to build an LLM application. You simply prompt an LLM. It's…
Zhou Yu|
I spent 6 months testing my AI agent manually. ArkSim replaced that entire process in under 30 minutes, with a structured, traceable report I could hand directly…
Junshuo Liu|
An education company built an AI agent to handle student administrative questions, such as exam scheduling, test policies, transcripts. Powered by GPT-5.1 with a…
Yi Ju|
Why multi-turn simulation exposes retry safety gaps that evaluation can’t see. Agents Should Be Tested Like Systems In the last post, I argued that agents like…
Arbit Chen|
When agents move from prompts to production, evaluation isn’t enough—testing becomes mandatory. From Model to System: Why OpenClaw-Style Agents Must Be Tested, …
Arbit Chen|
The difference between general and user-situated agents is fundamental: optimizing for specific users changes how they must be built and tested. Why This Distin…
Zhou Yu|
Why Simulation Is Becoming Essential for Testing AI Agents Salesforce recently introduced an approach for testing AI agents inside simulated enterprise environm…
Sarah Sun|
What happens when AI starts evaluating itself? We might be closer to that than you think. The Study Last month, Nielsen Norman Group published a really insigh…
Sarah Sun|
GED testing services are increasingly turning to AI chatbots to help students prepare for their exams. About GED Testing Service GED Testing Service (GEDTS), p…
Sarah Sun|
The future of e-commerce is being reshaped by AI shopping assistants that drive real revenue. Read the full white paper here: https://drive.google.com/file/d/1H…
Sarah Sun|
Imagine this: you’ve just rolled out a new AI chatbot. The launch meeting is full of optimism — executives are excited, employees are curious, and customers are …
Sarah Sun|
Enhancing Conversational AI for E-Commerce with Bottom-Up Synthesis Kun Qian¹, Maximillian Chen¹, Siyan Li¹, Arpit Sharma², Zhou Yu¹³ 1: Columbia University 2…
Zhou Yu|
Your AI Agent Works in Testing Then Crashes with Real Users. Why? It’s not your model. It’s how you’re testing it. Manual tests and happy-path demos miss real…
Yi Ju|