June 25, 20267 min readRishi

Evaluating LLM Applications: Getting Past 'It Looks Good to Me'

Here is how most LLM features are tested: someone tweaks the prompt, tries three or four examples by hand, sees that the output looks reasonable, and ships. This works right up until the day a "small improvement" to the prompt fixes the one case you were looking at and silently breaks ten cases you weren't. You won't find out from your tests — there are no tests. You'll find out from a user, or worse, you won't find out at all and quality just quietly erodes release over release.

The problem is that LLM outputs feel subjective, so teams conclude they can't be tested like normal software. That conclusion is wrong, and acting on it is how you ship a system nobody can confidently change. You can evaluate LLM applications systematically — it just looks different from assertEquals. Here's how to build the discipline.

Why traditional tests don't fit (and what replaces them)

A unit test asserts exact equality: add(2, 2) == 4. LLM outputs break this in two ways. They're non-deterministic — the same input can yield different wordings — and they're open-ended — many different outputs are equally correct. "Paris is the capital of France" and "The capital of France is Paris" are both right; an exact-match assertion fails on a correct answer.

So you replace exact-match with graded evaluation against criteria. Instead of "does the output equal this string," you ask "does the output satisfy these properties." The shift is from binary equality to scored rubrics, and once you make it, evaluation becomes tractable again. Everything below is about making that grading reliable and cheap to run.

Start with a dataset, not a metric

The foundational artifact of LLM evaluation isn't a clever scoring function — it's a dataset of examples. You cannot improve what you can't measure, and you can't measure without cases to measure on. This is the step teams skip, and skipping it is why they're stuck testing by vibes.

Build it deliberately:

Mine real usage. Your production logs are the best source of realistic inputs. Pull a representative sample of actual user queries — they're more honest than anything you'll invent at your desk.
Capture every failure. When a user reports a bad output, or you spot one, add it to the dataset. This is the single most valuable habit in the whole practice. Each captured failure becomes a permanent regression test — that exact case can never silently break again without your eval catching it. Over time your dataset becomes a museum of every way your system has ever been wrong, and nothing reopens an old wound.
Cover the edges deliberately. Empty inputs, adversarial prompts, the longest realistic input, multiple languages, ambiguous requests. The cases that break systems are rarely the average ones.

You don't need thousands of examples to start. Fifty to a hundred well-chosen cases catch a surprising fraction of regressions and are enough to make changes with confidence instead of crossing your fingers. The dataset grows naturally as you capture failures.

Three kinds of evaluation, in order of preference

Not every check needs an LLM to grade it. Reach for the cheapest, most reliable method that fits each criterion — and reserve the expensive ones for what genuinely needs judgment.

1. Code-based assertions (cheap, deterministic, do these first). A lot of what you care about is checkable with plain code, no AI required:

def check(output, case):
    assert is_valid_json(output)                      # structural
    assert case["must_mention"] in output             # contains key fact
    assert "as an AI" not in output.lower()           # no boilerplate
    assert len(output) < 2000                          # length bound
    assert not contains_pii(output)                    # safety

Format validity, required keywords, forbidden phrases, length limits, schema conformance, regex on a reference number — all of this is fast, free, deterministic, and flake-free. Push as much of your evaluation into this bucket as possible. People reach for fancy LLM grading when half their criteria are really just asserts.

2. LLM-as-judge (for genuine quality judgments). For criteria that need understanding — is this answer faithful to the source? helpful? appropriately toned? — use a strong model to grade the output against a rubric:

You are grading a customer-support reply.
Question: {question}
Retrieved context: {context}
Reply: {reply}

Score 1–5 on each, with a one-sentence justification:
- Faithfulness: every claim is supported by the context (no invention)
- Relevance: directly addresses the question
- Tone: professional and warm
Return JSON: {scores: {...}, reasons: {...}}

LLM-as-judge is powerful but has sharp edges you must respect: judges carry biases (favoring longer answers, or answers that resemble their own style), they're non-deterministic themselves, and a vague rubric gives noisy scores. Mitigate by making rubrics specific (concrete criteria beat "rate the quality"), asking for a justification alongside the score (it improves consistency and lets you audit), and — critically — validating the judge against human labels on a sample. If your automated judge doesn't correlate with human judgment, its scores are theater. Don't trust a judge you haven't checked.

3. Human evaluation (the gold standard, used sparingly). Humans grading outputs is the most reliable signal and the most expensive. Use it to calibrate your automated judges, to settle cases the automation flags as borderline, and to spot failure modes you didn't think to encode. You can't do it on every change, but you should do it periodically — and definitely use it to bootstrap and audit your LLM-as-judge.

Wire it into the loop

An eval you run by hand once a month is barely an eval. The value compounds when it's part of how you change the system:

Run evals on every prompt or model change. Treat the prompt as code: no change ships without the eval suite passing. This is what stops the "fixed one case, broke ten" disaster — you see the ten break, in a diff, before they reach users.
Track scores over time. A dashboard of eval scores per release turns silent quality drift into a visible trend line. Quality erosion that no single change would have flagged becomes obvious in aggregate.
Gate releases on no-regression. Set a floor. If a change drops faithfulness below the bar, it doesn't ship until you understand why. The eval is the guardrail.
Test model migrations against it. When a new model version comes out, your eval set tells you in an afternoon whether it's actually better for your task — not for some generic benchmark. This is how you upgrade models with confidence instead of hope.

The mindset shift

The deeper change isn't a tool or a metric — it's treating evaluation as a first-class part of building with LLMs, on par with writing the prompts themselves. The teams that ship reliable AI features aren't the ones with the cleverest prompts; they're the ones who can measure whether a change made things better or worse, and who built that measurement before they needed it.

Without evals, every change is a gamble and you're flying blind on quality. With them, you can iterate aggressively — try a bold new prompt, swap models, refactor your retrieval — and know, in minutes, whether you improved things or regressed. That confidence is the entire difference between an LLM demo that wows in a meeting and an LLM product that holds up in production. Build the dataset, capture every failure, automate what you can, and put a human in the loop where it counts. The vibes were never going to scale.

SharePost Share

Keep reading

Jun 23, 20267 min read

Getting Structured Output From an LLM Without the Heartbreak

Asking a model to 'return JSON' and parsing the result is how you get 3 a.m. pages. Tool schemas, constrained decoding, and validation turn a probabilistic text generator into a reliable API.

ai llm tutorial

Jun 21, 20267 min read

Chunking Strategies for RAG: Where Retrieval Quality Is Won or Lost

Most RAG systems that retrieve bad context aren't failing at embeddings or reranking — they're failing at chunking. How you split documents quietly decides what your model can ever find.