Benchmarks Are for Labs. Evals Are for You.

You picked a model. Good.

The leaderboard told you Gemini 3 Pro wins reasoning benchmarks. Claude Opus leads professional tasks. GPT-5 is strong across categories. You've done your homework. You made a call.

Now here's the question the leaderboard can't answer: is it working?

Not working in general. Working on YOUR task, with YOUR data, for YOUR users. That's a different question entirely.

Benchmarks score models. You need to score yours.

Their Grade, Not Yours

The labs are benchmarking against Humanity's Last Exam. SWE Atlas. MCP Atlas. These are designed to stress-test general frontier capability across thousands of tasks from hundreds of domains.

Genuine signal. Serious engineering. Not your problem.

Those scores have a contamination problem: the models fighting for top spots are also optimizing for the tests. Scale's own analysis found 13%+ accuracy drops when models were evaluated on new variants they hadn't seen. The winner this quarter may have learned the answers - not the reasoning.

A high leaderboard score tells you one thing: this model can reason well in general. It says nothing about YOUR specific task. Those are not the same question.

This is their race. Benchmarks are how labs measure their progress against each other. Not how you measure your product against your users.

Build an eval.

The Number That Changes Everything

"Writing evals is going to become a core skill for product managers." - Kevin Weil, CPO of OpenAI

Not because evals are trendy. Because the number they produce changes everything about what you build.

You need to know your accuracy number. Not the model's general accuracy across hundreds of tasks. Your task's accuracy. How often does this model get it right on the one thing your product depends on?

That number determines your entire product:

Accuracy	What Your Product Looks Like
60%	Heavy human review. AI as suggestion engine, not actor.
95%	Light review. Flag the exceptions. Let the rest through.
99.5%	Automated with audit trail. Humans in the loop for edge cases only.

That's not a small difference. That's your UX, your trust model, your staffing plan, your launch timeline.

The leaderboard gives you a score across hundreds of tasks. Against your one task, that's noise. Your number is specific. It's measurable. It's yours to get - but only if you measure it.

So you decide to build an eval. And this is where most people go wrong.

Why Evals Go Off the Rails

"A lot of people go straight into evals like, 'Let me just write some tests,' and that is where things go off the rails." - Hamel Husain

There are three places this breaks down.

Testing the wrong thing. You build evals for the behaviors you assume matter, before looking at what actually happens in production. The result: a passing eval suite for the wrong task. Before you write a single test, look at your traces - real outputs, real users, what your model is actually producing.

The 1-5 scale. You ask your eval to score outputs on a 1-5 scale. That's a weasel way of not making a decision. When you report 3.2 versus 3.7, no one knows what that means. Make it binary. Pass or fail. Good or not good. The binary decision forces judgment. That judgment builds your ground truth.

Automating before calibrating. You deploy an LLM as judge before validating it against human opinion. "Before you release your LLM as a judge, you want to make sure it's aligned to the human." An uncalibrated judge introduces error at the measurement layer. Then you're measuring the wrong thing accurately - which is worse than not measuring at all.

The highest ROI move isn't writing tests. It's looking at what your model is already doing.

What Actually Works

"It's the highest ROI activity you can engage in. Let's go look at your traces. People are surprised... It always 100% of the time teaches you what the problem is." - Hamel Husain

Here's the practical framework. Four moves.

Start with traces, not tests. Open your production logs. Look at actual model outputs. What is it getting right? What is it getting wrong? Don't write a test until you know what failure modes actually exist. The failure modes are already there. You just haven't looked.

Make it binary. For each output you review: pass or fail. Good or not good. Build a labeled dataset from real outputs. That IS your ground truth. No scoring gymnastics, no 3.2 vs 3.7 debates - just decisions.

Right-size it.

"If we cannot collect the sample size in a month, we shouldn't test it. We should just go pre versus post." - Chip Huyen

Small and accurate beats large and unmanageable. You don't need statistical significance before you launch. You need YOUR number. Start there.

Don't over-invest when the delta is small.

"Maybe improve it from 80% to 82%. But if we spend those two engineers on a new feature, we could get so much more improvement." - Chip Huyen

Evals are tools, not ends. When the ROI of more eval engineering beats the ROI of building more product - invest. When it doesn't, ship.

The result of these four moves: a small, accurate, binary eval suite that tells you YOUR number. And once you have it - it's yours.

Your Eval Suite Is Part of Your Base

Post #1 established: YOU^AI. The model is the exponent. You are the base. Same model, different bases - wildly different results.

Your eval suite is part of your base. It's where your judgment about quality lives - made explicit and portable. You know what good output looks like for your task. That knowledge came from looking at your data, making binary calls, learning where the model succeeds and fails. It took time to develop. It's yours.

Other tools: session ends, eval knowledge disappears. Your workbench: evals compound, accuracy knowledge persists.

When you switch models - and you will switch - you point your eval suite at the new one. Your measurement layer persists. Your ground truth persists. You know within hours whether the new model improves your number. Only the model changes.

This is the practical meaning of model-agnostic: not just that your workbench works with any model, but that your judgment about what good looks like is portable. You don't rebuild your eval suite every time the leaderboard reshuffles. You run it. That's the compounding advantage.

The Invitation

The leaderboard is a starting point. It narrows the field. It tells you which models are worth testing.

Use it for that.

But after you pick one, you have work to do. Build an eval. Look at your traces. Make it binary. Get your number. That number tells you what product you're building, what to improve next, and whether switching models is worth it - or just noise.

The eval suite you build today is knowledge you keep. It's encoded in your workbench. It compounds. Next quarter, when the cost-quality calculus shifts and a new model leads, you don't start over. You run your suite against it. Decide in hours, not weeks.

Build anything with AI. Keep everything. Evolve forever. The eval suite is what you keep. That's not a technical artifact. That's compounding judgment.

Start building - free →