LLM eval 2026: (1) Automatic metrics — Ragas (answer_relevancy, faithfulness), BLEU / ROUGE для translation, Pass@K для code, (2) LLM-as-judge — GPT-5 оценивает output другой LLM по rubric, (3) Human eval — ground truth для production critical, (4) Continuous: CI/CD runs eval suite на каждый prompt change. Tools: LangSmith, LangFuse, Weights & Biases, Braintrust.
Ниже: пошаговая инструкция, рабочие примеры, типичные ошибки, FAQ.
answer_relevancy, context_precision, faithfulness| Сценарий | Конфиг |
|---|---|
| Ragas RAG evaluation | from ragas import evaluate
from ragas.metrics import answer_relevancy, context_precision, faithfulness
results = evaluate(
dataset, # {question, answer, contexts, ground_truth}
metrics=[answer_relevancy, context_precision, faithfulness]
)
print(results) # { 'answer_relevancy': 0.87, 'faithfulness': 0.92, ... } |
| LLM-as-judge prompt | judge_prompt = f"""Rate the answer 1-5 on accuracy + completeness.
Question: {q}
Expected: {gt}
Actual: {out}
Score + 1-sentence reason in JSON."""
judgment = await openai.chat.completions.create(model='gpt-5', messages=[{...judge_prompt}]) |
| LangSmith tracing | from langsmith import Client
client = Client()
# Automatic trace per call с annotate
with client.trace(name='rag_pipeline'):
answer = rag_chain.invoke(query)
client.log_feedback(run_id=..., key='accuracy', score=0.9) |
| CI eval suite (Promptfoo) | # promptfoo.yaml
prompts: [promptA.txt, promptB.txt]
providers: [openai:gpt-5, anthropic:claude-opus-4-7]
tests:
- vars: { question: 'TCP vs UDP' }
assert:
- type: contains
value: 'reliable'
- type: llm-rubric
value: 'clearly explains connection-oriented' |
| Production sampling | // 1% traffic shadow eval
if (Math.random() < 0.01) {
const judgeScore = await llmJudge(userQuery, actualResponse);
metrics.gauge('llm_quality', judgeScore);
} |
Каждый prompt / model change — full suite. Daily cron smaller smoke test. Weekly — human review of sample.
LangSmith: LangChain maintained, tight integration. LangFuse: open-source, self-host possible. Braintrust: evaluation-first.
Automatic — scale + consistency, но миссит nuanced. Human — expensive ($1-5 per example) но ground truth. Combine.
Для internal RAG — Ragas weekly. Для prompt changes — Promptfoo в CI. Для production quality — LLM-as-judge sampling 1% traffic.