🤔 What Problem Does Serverless AI Solve?
Let’s say you fine‑tune Llama 3 8B on your company’s support tickets. You want to expose it as an API. Your options:
- Option A (traditional): Rent an A100 GPU from AWS (~$3/hour). Keep it running 24/7. That’s $2,160/month, even if you get only 100 requests. You also need to set up Docker, load balancing, and autoscaling.
- Option B (serverless): Push the model to Replicate or Modal. The first request takes ~5 seconds to load the model (cold start). Subsequent requests are fast. When no requests come, it scales to zero – you pay $0. That $2,160/month becomes maybe $10/month for low usage.
Serverless AI is like AWS Lambda, but for GPU‑heavy models.
🌩️ The Main Serverless AI Platforms (2026)
| Platform | Best for | Cold start | Pricing | Special feature |
|---|---|---|---|---|
| Replicate | Quick prototypes, image generation, community models | 2‑5 seconds | $0.0005‑$0.05 per second of GPU + per‑execution fee | Huge library of pre‑built models (Stable Diffusion, Llama, Whisper) |
| Modal | Complex workflows (multi‑step, large data), batch jobs | 5‑10 seconds | $0.00026 per second (A100) + volume storage | Serverless functions that can run for hours; spot instances (70% cheaper) |
| Banana.dev | Low‑latency inference, model keep‑alive | 1‑2 seconds (with keep‑alive) | $0.00035 per second + $0.25/GB memory‑hour | Can keep models warm for 1¢/hour |
| Hugging Face Inference Endpoints | Hugging Face ecosystem | N/A (always on dedicated instances) | $0.60‑$1.50/hour (fixed) | Not truly serverless – but easy to deploy |
| Vercel AI (Edge) | LLM streaming for web apps | ~50ms (no GPU) | Free tier + $0.01 per request (for LLM APIs) | Serverless functions that call other LLM APIs, not self‑hosted |
📦 Step‑by‑Step: Deploy a Custom Model on Replicate
Replicate makes it absurdly easy. You’ll need a cog.yaml and a predict.py.
1. Install Cog (Replicate’s build tool)
sudo curl -o /usr/local/bin/cog -L "https://github.com/replicate/cog/releases/latest/download/cog_$(uname -s)_$(uname -m)"
sudo chmod +x /usr/local/bin/cog
2. Create cog.yaml
build:
gpu: true
python_version: "3.11"
python_packages:
- torch
- transformers
- accelerate
predict: "predict.py:Predictor"
3. Write predict.py
from cog import BasePredictor, Input
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
class Predictor(BasePredictor):
def setup(self):
self.model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8b", torch_dtype=torch.float16).cuda()
self.tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8b")
def predict(self, prompt: str = Input(description="Input prompt")) -> str:
inputs = self.tokenizer(prompt, return_tensors="pt").cuda()
outputs = self.model.generate(**inputs, max_new_tokens=100)
return self.tokenizer.decode(outputs[0])
4. Push to Replicate
cog push r8.im/yourusername/llama-3-8b
That’s it. Replicate builds the Docker image, provisions GPU on demand, and gives you an API endpoint. You can now call it from any language:
import replicate
output = replicate.run("yourusername/llama-3-8b", input={"prompt": "Explain serverless AI"})
print("".join(output))
⚡ Performance: Cold Starts and Keep‑Alive
The biggest downside of serverless is the cold start. When your model hasn’t been used for a few minutes (Replicate: ~5 minutes idle; Modal: 10‑15 minutes), the platform unloads it. The next request triggers a reload, which can take 5‑30 seconds depending on model size.
Mitigations:
- Keep‑alive: Banana.dev and Modal allow you to keep one “warm” instance for a small hourly fee (e.g., $0.01‑$0.05/hour). This eliminates cold starts for that instance.
- Background pings: Set up a cron job that calls your model every 4 minutes to keep it alive (slightly abuse of terms, but works).
- Design for async: For batch processing, cold starts don’t matter – just queue requests and accept the first one being slow.
💰 Cost Comparison: Serverless vs. Dedicated GPU
Scenario A: 10,000 requests/month, each taking 2 seconds of GPU time.
- Serverless (Replicate): 10,000 * 2s * $0.0005/s = $10 + $0.0001 per request = ~$11/month.
- Dedicated GPU (AWS g4dn.xlarge ~$0.50/hour): 24/7 = $360/month.
Serverless wins hands down for low to medium volume.
Scenario B: 10 million requests/month, each 2 seconds. That’s 20M seconds of GPU time.
- Serverless: 20M * $0.0005 = $10,000/month.
- Dedicated (need ~5 GPUs at 80% utilization): 5 * $360 = $1,800/month.
Dedicated GPUs become cheaper at very high volume (millions of requests per day).
🛠️ When to Use Serverless AI (And When Not To)
✅ Perfect for:
- Prototyping and MVPs – You don’t know your traffic yet.
- Batch inference – Process 10,000 images once a week. Cold start doesn’t matter.
- Sporadic APIs – A Slack bot that gets 100 requests a day.
- Developers without DevOps skills – No Kubernetes required.
❌ Avoid for:
- High‑throughput production (millions of requests/day) – Dedicated GPUs are cheaper.
- Latency‑sensitive apps (<100ms) – Cold starts will kill you. Use keep‑alive or dedicated.
- Models > 50GB – Some serverless platforms have size limits. Check Modal (supports up to 200GB volumes).
Start with Replicate or Modal for your MVP. As your traffic grows, add a dedicated GPU endpoint (e.g., using RunPod or AWS) for the most frequent requests, but keep serverless for the long tail. Use a simple router: if request count < 1000/day → serverless; else → dedicated. You'll always have the optimal cost.
🧪 Beyond Replicate: Modal for Complex Workflows
Modal is more powerful than Replicate – it’s a full serverless Python environment with GPU support. You can write multi‑step pipelines (download data → process → model inference → upload results) as a single Python script. Example:
import modal
stub = modal.Stub("my-rag-pipeline")
@stub.function(gpu="A100")
def embed_and_store(documents):
# Your custom logic
pass
@stub.local_entrypoint()
def main():
embed_and_store.remote(my_docs)
Modal handles all the infrastructure. It’s like AWS Lambda but for AI.
Final verdict: Serverless AI is a game‑changer for 90% of AI projects. Only the top 10% of high‑volume, low‑latency applications need dedicated GPUs. Start serverless, and only move when you have proven scale.
