Serverless GPU 2026 сделал LLM hosting доступным: (1) Modal.com ($0.0005/s A10G) — Python-native, cold start 2-5s, (2) RunPod Serverless ($0.0003/s) — cheaper, template-based, (3) Replicate ($0.001/s) — pre-built models ready, (4) Cloudflare Workers AI — edge, limited model catalog. Альтернатива self-host на bare-metal GPU ($4-10/h). Pay-per-request для variable traffic, bare-metal для sustained.
Ниже: пошаговая инструкция, рабочие примеры, типичные ошибки, FAQ.
generate(prompt: str) -> str| Сценарий | Конфиг |
|---|---|
| Modal Python (Llama 3 70B) | import modal
app = modal.App('llama-inference')
image = modal.Image.debian_slim().pip_install('vllm')
@app.function(gpu='A100-80GB', image=image, timeout=600)
def generate(prompt: str) -> str:
from vllm import LLM
llm = LLM('meta-llama/Llama-3-70B-Instruct')
return llm.generate(prompt)[0].outputs[0].text
@app.local_entrypoint()
def main():
result = generate.remote('Hello') |
| RunPod Serverless | # RunPod handler.py
def handler(event):
prompt = event['input']['prompt']
output = llm.generate(prompt)
return {'text': output}
# Deploy via UI + Dockerfile (vllm/llama.cpp image) |
| Replicate (pre-built models) | import replicate
output = replicate.run(
'meta/llama-3-70b-instruct',
input={'prompt': 'Hello', 'max_tokens': 512}
) |
| Cloudflare Workers AI | export default {
async fetch(request, env) {
const { prompt } = await request.json();
const resp = await env.AI.run('@cf/meta/llama-3-8b-instruct', { prompt });
return Response.json(resp);
}
} |
| vLLM locally (docker) | $ docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest \
--model meta-llama/Llama-3-70B-Instruct \
--max-model-len 8192 |
Keep-warm strategy: ping каждые 5 мин держит container alive. Cost ~$0.10/hour idle. Или use provider с faster cold start (Modal 2s).
Modal: Python-native UX, дороже. RunPod: cheaper, requires Docker image. Для prototyping — Modal. Для production scale — RunPod.
Для production serving — yes. 2-5x throughput над raw transformers. PagedAttention + continuous batching.
Good для low-volume + pre-built models. Dedicated deployment (Modal, RunPod) дешевле для >1M tokens/day.