LLM eval 2026: (1) Automatic metrics — Ragas (answer_relevancy, faithfulness), BLEU / ROUGE for translation, Pass@K for code, (2) LLM-as-judge — GPT-5 evaluates another LLM output against a rubric, (3) Human eval — ground truth for production critical, (4) Continuous: CI/CD runs eval suite on every prompt change. Tools: LangSmith, LangFuse, Weights & Biases, Braintrust.
Below: step-by-step, working examples, common pitfalls, FAQ.
answer_relevancy, context_precision, faithfulness| Scenario | Config |
|---|---|
| Ragas RAG evaluation | from ragas import evaluate
from ragas.metrics import answer_relevancy, context_precision, faithfulness
results = evaluate(
dataset, # {question, answer, contexts, ground_truth}
metrics=[answer_relevancy, context_precision, faithfulness]
)
print(results) # { 'answer_relevancy': 0.87, 'faithfulness': 0.92, ... } |
| LLM-as-judge prompt | judge_prompt = f"""Rate the answer 1-5 on accuracy + completeness.
Question: {q}
Expected: {gt}
Actual: {out}
Score + 1-sentence reason in JSON."""
judgment = await openai.chat.completions.create(model='gpt-5', messages=[{...judge_prompt}]) |
| LangSmith tracing | from langsmith import Client
client = Client()
# Automatic trace per call with annotate
with client.trace(name='rag_pipeline'):
answer = rag_chain.invoke(query)
client.log_feedback(run_id=..., key='accuracy', score=0.9) |
| CI eval suite (Promptfoo) | # promptfoo.yaml
prompts: [promptA.txt, promptB.txt]
providers: [openai:gpt-5, anthropic:claude-opus-4-7]
tests:
- vars: { question: 'TCP vs UDP' }
assert:
- type: contains
value: 'reliable'
- type: llm-rubric
value: 'clearly explains connection-oriented' |
| Production sampling | // 1% traffic shadow eval
if (Math.random() < 0.01) {
const judgeScore = await llmJudge(userQuery, actualResponse);
metrics.gauge('llm_quality', judgeScore);
} |
On every prompt / model change — full suite. Daily cron for smaller smoke test. Weekly — human review of sample.
LangSmith: LangChain-maintained, tight integration. LangFuse: open-source, self-host possible. Braintrust: evaluation-first.
Automatic — scale + consistency, but misses nuance. Human — expensive ($1-5 per example) but ground truth. Combine them.
For internal RAG — Ragas weekly. For prompt changes — Promptfoo in CI. For production quality — LLM-as-judge sampling 1% traffic.