Как оценивать LLM output quality — 2026

Igor Verentsov

Как оценивать LLM quality

Автор: Igor Verentsov · Обновлено 18 апреля 2026

Коротко:

LLM eval 2026: (1) Automatic metrics — Ragas (answer_relevancy, faithfulness), BLEU / ROUGE для translation, Pass@K для code, (2) LLM-as-judge — GPT-5 оценивает output другой LLM по rubric, (3) Human eval — ground truth для production critical, (4) Continuous: CI/CD runs eval suite на каждый prompt change. Tools: LangSmith, LangFuse, Weights & Biases, Braintrust.

Ниже: пошаговая инструкция, рабочие примеры, типичные ошибки, FAQ.

Бесплатный онлайн-инструмент — проверка HTTP-заголовков: результат мгновенно, без регистрации.

Проверить свой сайт →

Пошаговая настройка

Соберите test set 50-500 representative examples с expected outputs
Выберите metrics: factual (accuracy), stylistic (tone), safety (toxicity)
Ragas для RAG: answer_relevancy, context_precision, faithfulness
LLM-as-judge: GPT-5 с rubric оценивает 1-5 по качеству
CI integration: run eval на каждый PR, fail если score drops
Monitor production: sample 1% traffic, real-time eval
Human review: weekly audit random 50 calls

Рабочие примеры

Сценарий	Конфиг
Ragas RAG evaluation	`from ragas import evaluate from ragas.metrics import answer_relevancy, context_precision, faithfulness results = evaluate( dataset, # {question, answer, contexts, ground_truth} metrics=[answer_relevancy, context_precision, faithfulness] ) print(results) # { 'answer_relevancy': 0.87, 'faithfulness': 0.92, ... }`
LLM-as-judge prompt	`judge_prompt = f"""Rate the answer 1-5 on accuracy + completeness. Question: {q} Expected: {gt} Actual: {out} Score + 1-sentence reason in JSON.""" judgment = await openai.chat.completions.create(model='gpt-5', messages=[{...judge_prompt}])`
LangSmith tracing	`from langsmith import Client client = Client() # Automatic trace per call с annotate with client.trace(name='rag_pipeline'): answer = rag_chain.invoke(query) client.log_feedback(run_id=..., key='accuracy', score=0.9)`
CI eval suite (Promptfoo)	`# promptfoo.yaml prompts: [promptA.txt, promptB.txt] providers: [openai:gpt-5, anthropic:claude-opus-4-7] tests: - vars: { question: 'TCP vs UDP' } assert: - type: contains value: 'reliable' - type: llm-rubric value: 'clearly explains connection-oriented'`
Production sampling	`// 1% traffic shadow eval if (Math.random() < 0.01) { const judgeScore = await llmJudge(userQuery, actualResponse); metrics.gauge('llm_quality', judgeScore); }`

Типичные ошибки

Test set too small (<20) — metrics unreliable. Min 50-100 examples
Metrics без human correlation — automatic не всегда align с real quality. Periodic human review
LLM-as-judge с той же модели что eval → bias. Используйте bigger или different model
Not tracking cost — eval может сжечь $100+ в cloud. Budget lim
Static test set устаревает — добавляйте examples из production edge cases

Больше по теме

Гайды

Исследования

Часто задаваемые вопросы

Как часто eval?

Каждый prompt / model change — full suite. Daily cron smaller smoke test. Weekly — human review of sample.

LangSmith vs LangFuse?

LangSmith: LangChain maintained, tight integration. LangFuse: open-source, self-host possible. Braintrust: evaluation-first.

Automatic vs human eval?

Automatic — scale + consistency, но миссит nuanced. Human — expensive ($1-5 per example) но ground truth. Combine.

Enterno использует?

Для internal RAG — Ragas weekly. Для prompt changes — Promptfoo в CI. Для production quality — LLM-as-judge sampling 1% traffic.

Запустить инструмент, который описан в этой статье

Бесплатный тариф — 10 мониторов, проверки каждые 5 мин, без карты. Платные тарифы — интервал от 1 минуты и проверки из нескольких регионов.

Начать бесплатно Тарифы