Skip to content

How to Evaluate LLM Quality

Key idea:

LLM eval 2026: (1) Automatic metrics — Ragas (answer_relevancy, faithfulness), BLEU / ROUGE for translation, Pass@K for code, (2) LLM-as-judge — GPT-5 evaluates another LLM output against a rubric, (3) Human eval — ground truth for production critical, (4) Continuous: CI/CD runs eval suite on every prompt change. Tools: LangSmith, LangFuse, Weights & Biases, Braintrust.

Below: step-by-step, working examples, common pitfalls, FAQ.

Try it now — free →

Step-by-Step Setup

  1. Collect a test set of 50-500 representative examples with expected outputs
  2. Pick metrics: factual (accuracy), stylistic (tone), safety (toxicity)
  3. Ragas for RAG: answer_relevancy, context_precision, faithfulness
  4. LLM-as-judge: GPT-5 with rubric rates 1-5 on quality
  5. CI integration: run eval on every PR, fail if score drops
  6. Monitor production: sample 1% traffic, real-time eval
  7. Human review: weekly audit of random 50 calls

Working Examples

ScenarioConfig
Ragas RAG evaluationfrom ragas import evaluate from ragas.metrics import answer_relevancy, context_precision, faithfulness results = evaluate( dataset, # {question, answer, contexts, ground_truth} metrics=[answer_relevancy, context_precision, faithfulness] ) print(results) # { 'answer_relevancy': 0.87, 'faithfulness': 0.92, ... }
LLM-as-judge promptjudge_prompt = f"""Rate the answer 1-5 on accuracy + completeness. Question: {q} Expected: {gt} Actual: {out} Score + 1-sentence reason in JSON.""" judgment = await openai.chat.completions.create(model='gpt-5', messages=[{...judge_prompt}])
LangSmith tracingfrom langsmith import Client client = Client() # Automatic trace per call with annotate with client.trace(name='rag_pipeline'): answer = rag_chain.invoke(query) client.log_feedback(run_id=..., key='accuracy', score=0.9)
CI eval suite (Promptfoo)# promptfoo.yaml prompts: [promptA.txt, promptB.txt] providers: [openai:gpt-5, anthropic:claude-opus-4-7] tests: - vars: { question: 'TCP vs UDP' } assert: - type: contains value: 'reliable' - type: llm-rubric value: 'clearly explains connection-oriented'
Production sampling// 1% traffic shadow eval if (Math.random() < 0.01) { const judgeScore = await llmJudge(userQuery, actualResponse); metrics.gauge('llm_quality', judgeScore); }

Common Pitfalls

  • Test set too small (<20) — metrics unreliable. Minimum 50-100 examples
  • Metrics without human correlation — automatic does not always align with real quality. Periodic human review
  • LLM-as-judge using the same model as eval → bias. Use a bigger or different model
  • Not tracking cost — eval can burn $100+ in cloud. Budget limit
  • Static test set grows stale — add examples from production edge cases

Learn more

Frequently Asked Questions

How often should I eval?

On every prompt / model change — full suite. Daily cron for smaller smoke test. Weekly — human review of sample.

LangSmith vs LangFuse?

LangSmith: LangChain-maintained, tight integration. LangFuse: open-source, self-host possible. Braintrust: evaluation-first.

Automatic vs human eval?

Automatic — scale + consistency, but misses nuance. Human — expensive ($1-5 per example) but ground truth. Combine them.

What does Enterno use?

For internal RAG — Ragas weekly. For prompt changes — Promptfoo in CI. For production quality — LLM-as-judge sampling 1% traffic.