Fine-tuning 2026: (1) Prepare 100-10k examples in JSONL, (2) Pick platform — OpenAI (gpt-4o-mini FT $3/1M), Together.ai Llama 3 70B LoRA ($5-20), or self-host via Axolotl/Unsloth, (3) Upload dataset + start job (1-10 hours), (4) Eval via test set, (5) Deploy — OpenAI creates endpoint, Together returns API. When NOT to: if RAG + prompt engineering already solve the task.
Below: step-by-step, working examples, common pitfalls, FAQ.
openai fine_tuning.jobs.create -t file-X -m gpt-4o-mini| Scenario | Config |
|---|---|
| OpenAI JSONL format | {"messages": [
{"role": "system", "content": "You are a customer support bot for Enterno."},
{"role": "user", "content": "Where is my invoice?"},
{"role": "assistant", "content": "You can find invoices at /dashboard → Billing → History."}
]} |
| QLoRA locally (Unsloth) | from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained('unsloth/llama-3-8b-bnb-4bit')
model = FastLanguageModel.get_peft_model(model, r=16, target_modules=['q_proj','k_proj','v_proj'])
trainer = SFTTrainer(model=model, train_dataset=ds, max_seq_length=2048)
trainer.train() |
| Together.ai CLI | $ together files upload train.jsonl
$ together fine-tuning create \
--training-file FILE_ID \
--model meta-llama/Meta-Llama-3.1-70B-Instruct-Reference \
--lora --lora-r 16 --lora-alpha 32 |
| Inference after FT | # OpenAI
resp = client.chat.completions.create(
model='ft:gpt-4o-mini-2024:myorg::abc',
messages=[...]
) |
| Eval with Ragas | from ragas import evaluate
from ragas.metrics import answer_relevancy, faithfulness
results = evaluate(dataset, metrics=[answer_relevancy, faithfulness]) |
RAG: dynamic knowledge, easy update. FT: style, tone, format consistency. Often combined — FT for tone + RAG for facts.
OpenAI gpt-4o-mini FT: $3/1M training tokens. Together Llama 3 70B LoRA: ~$5-20 per run. Self-host: $0 if you have a GPU.
Held-out test set (20%). Metrics depend on task: exact match, BLEU, LLM-as-judge (GPT-4 grades outputs).
LoRA: 0.1-1% params updated, fast, cheap. Full FT: all params, best quality but 10-100x cost. For 95% of use cases LoRA is enough.