Skip to content

How to Fine-tune LLM

Key idea:

Fine-tuning 2026: (1) Prepare 100-10k examples in JSONL, (2) Pick platform — OpenAI (gpt-4o-mini FT $3/1M), Together.ai Llama 3 70B LoRA ($5-20), or self-host via Axolotl/Unsloth, (3) Upload dataset + start job (1-10 hours), (4) Eval via test set, (5) Deploy — OpenAI creates endpoint, Together returns API. When NOT to: if RAG + prompt engineering already solve the task.

Below: step-by-step, working examples, common pitfalls, FAQ.

Try it now — free →

Step-by-Step Setup

  1. Collect 100+ quality examples in JSONL format
  2. Validation split: 80% train / 20% eval
  3. OpenAI: openai fine_tuning.jobs.create -t file-X -m gpt-4o-mini
  4. Together.ai: upload via CLI, config LoRA (rank=16, alpha=32)
  5. Monitor loss curve — stop if overfitting (eval loss rises)
  6. Eval on test set — accuracy / BLEU / manual grading
  7. Deploy: OpenAI → auto endpoint. Together → API key

Working Examples

ScenarioConfig
OpenAI JSONL format{"messages": [ {"role": "system", "content": "You are a customer support bot for Enterno."}, {"role": "user", "content": "Where is my invoice?"}, {"role": "assistant", "content": "You can find invoices at /dashboard → Billing → History."} ]}
QLoRA locally (Unsloth)from unsloth import FastLanguageModel model, tokenizer = FastLanguageModel.from_pretrained('unsloth/llama-3-8b-bnb-4bit') model = FastLanguageModel.get_peft_model(model, r=16, target_modules=['q_proj','k_proj','v_proj']) trainer = SFTTrainer(model=model, train_dataset=ds, max_seq_length=2048) trainer.train()
Together.ai CLI$ together files upload train.jsonl $ together fine-tuning create \ --training-file FILE_ID \ --model meta-llama/Meta-Llama-3.1-70B-Instruct-Reference \ --lora --lora-r 16 --lora-alpha 32
Inference after FT# OpenAI resp = client.chat.completions.create( model='ft:gpt-4o-mini-2024:myorg::abc', messages=[...] )
Eval with Ragasfrom ragas import evaluate from ragas.metrics import answer_relevancy, faithfulness results = evaluate(dataset, metrics=[answer_relevancy, faithfulness])

Common Pitfalls

  • Do not start with FT — first try prompt engineering + RAG. 80% of cases are solved without FT
  • Dataset too small (<50 examples) — overfit, does not learn general pattern
  • Inconsistent format across examples — model gets confused
  • Training without validation set → you miss overfitting
  • FT changes weights — base model knowledge can degrade ("catastrophic forgetting")

Learn more

Frequently Asked Questions

RAG or FT?

RAG: dynamic knowledge, easy update. FT: style, tone, format consistency. Often combined — FT for tone + RAG for facts.

Cost?

OpenAI gpt-4o-mini FT: $3/1M training tokens. Together Llama 3 70B LoRA: ~$5-20 per run. Self-host: $0 if you have a GPU.

How to measure improvement?

Held-out test set (20%). Metrics depend on task: exact match, BLEU, LLM-as-judge (GPT-4 grades outputs).

LoRA vs full FT?

LoRA: 0.1-1% params updated, fast, cheap. Full FT: all params, best quality but 10-100x cost. For 95% of use cases LoRA is enough.