Skip to content

Как оценивать LLM quality

Коротко:

LLM eval 2026: (1) Automatic metrics — Ragas (answer_relevancy, faithfulness), BLEU / ROUGE для translation, Pass@K для code, (2) LLM-as-judge — GPT-5 оценивает output другой LLM по rubric, (3) Human eval — ground truth для production critical, (4) Continuous: CI/CD runs eval suite на каждый prompt change. Tools: LangSmith, LangFuse, Weights & Biases, Braintrust.

Ниже: пошаговая инструкция, рабочие примеры, типичные ошибки, FAQ.

Попробовать бесплатно →

Пошаговая настройка

  1. Соберите test set 50-500 representative examples с expected outputs
  2. Выберите metrics: factual (accuracy), stylistic (tone), safety (toxicity)
  3. Ragas для RAG: answer_relevancy, context_precision, faithfulness
  4. LLM-as-judge: GPT-5 с rubric оценивает 1-5 по качеству
  5. CI integration: run eval на каждый PR, fail если score drops
  6. Monitor production: sample 1% traffic, real-time eval
  7. Human review: weekly audit random 50 calls

Рабочие примеры

СценарийКонфиг
Ragas RAG evaluationfrom ragas import evaluate from ragas.metrics import answer_relevancy, context_precision, faithfulness results = evaluate( dataset, # {question, answer, contexts, ground_truth} metrics=[answer_relevancy, context_precision, faithfulness] ) print(results) # { 'answer_relevancy': 0.87, 'faithfulness': 0.92, ... }
LLM-as-judge promptjudge_prompt = f"""Rate the answer 1-5 on accuracy + completeness. Question: {q} Expected: {gt} Actual: {out} Score + 1-sentence reason in JSON.""" judgment = await openai.chat.completions.create(model='gpt-5', messages=[{...judge_prompt}])
LangSmith tracingfrom langsmith import Client client = Client() # Automatic trace per call с annotate with client.trace(name='rag_pipeline'): answer = rag_chain.invoke(query) client.log_feedback(run_id=..., key='accuracy', score=0.9)
CI eval suite (Promptfoo)# promptfoo.yaml prompts: [promptA.txt, promptB.txt] providers: [openai:gpt-5, anthropic:claude-opus-4-7] tests: - vars: { question: 'TCP vs UDP' } assert: - type: contains value: 'reliable' - type: llm-rubric value: 'clearly explains connection-oriented'
Production sampling// 1% traffic shadow eval if (Math.random() < 0.01) { const judgeScore = await llmJudge(userQuery, actualResponse); metrics.gauge('llm_quality', judgeScore); }

Типичные ошибки

  • Test set too small (<20) — metrics unreliable. Min 50-100 examples
  • Metrics без human correlation — automatic не всегда align с real quality. Periodic human review
  • LLM-as-judge с той же модели что eval → bias. Используйте bigger или different model
  • Not tracking cost — eval может сжечь $100+ в cloud. Budget lim
  • Static test set устаревает — добавляйте examples из production edge cases

Больше по теме

Часто задаваемые вопросы

Как часто eval?

Каждый prompt / model change — full suite. Daily cron smaller smoke test. Weekly — human review of sample.

LangSmith vs LangFuse?

LangSmith: LangChain maintained, tight integration. LangFuse: open-source, self-host possible. Braintrust: evaluation-first.

Automatic vs human eval?

Automatic — scale + consistency, но миссит nuanced. Human — expensive ($1-5 per example) но ground truth. Combine.

Enterno использует?

Для internal RAG — Ragas weekly. Для prompt changes — Promptfoo в CI. Для production quality — LLM-as-judge sampling 1% traffic.