Serverless AI: Deploying Models Without Managing Infrastructure

Serverless AI – Deep Guide to Deploying Models Without GPUs
⚡ Serverless AI Deploy models without managing GPUs Auto‑scale Replicate · Modal · Banana per‑second billing Cold starts · keep‑alive · pay only for inference
Deploy a model in 10 lines of code. No Kubernetes. No GPU provisioning. Here’s how – with real examples and cost comparisons.
📖 Plain‑English summary: Traditionally, deploying an AI model meant renting a GPU server, setting up a web server, scaling it, and monitoring it – even if you only got 10 requests a day. Serverless AI platforms flip that: you upload your model (or pick one from a library), and they give you an API that automatically scales to zero when not in use. You pay only for the milliseconds your model is actually running. This is perfect for prototypes, batch jobs, or any traffic that isn’t 24/7. This post explains the major platforms, costs, and when to use (or not use) serverless.

🤔 What Problem Does Serverless AI Solve?

Let’s say you fine‑tune Llama 3 8B on your company’s support tickets. You want to expose it as an API. Your options:

  • Option A (traditional): Rent an A100 GPU from AWS (~$3/hour). Keep it running 24/7. That’s $2,160/month, even if you get only 100 requests. You also need to set up Docker, load balancing, and autoscaling.
  • Option B (serverless): Push the model to Replicate or Modal. The first request takes ~5 seconds to load the model (cold start). Subsequent requests are fast. When no requests come, it scales to zero – you pay $0. That $2,160/month becomes maybe $10/month for low usage.

Serverless AI is like AWS Lambda, but for GPU‑heavy models.

🌩️ The Main Serverless AI Platforms (2026)

Platform Best for Cold start Pricing Special feature
Replicate Quick prototypes, image generation, community models 2‑5 seconds $0.0005‑$0.05 per second of GPU + per‑execution fee Huge library of pre‑built models (Stable Diffusion, Llama, Whisper)
Modal Complex workflows (multi‑step, large data), batch jobs 5‑10 seconds $0.00026 per second (A100) + volume storage Serverless functions that can run for hours; spot instances (70% cheaper)
Banana.dev Low‑latency inference, model keep‑alive 1‑2 seconds (with keep‑alive) $0.00035 per second + $0.25/GB memory‑hour Can keep models warm for 1¢/hour
Hugging Face Inference Endpoints Hugging Face ecosystem N/A (always on dedicated instances) $0.60‑$1.50/hour (fixed) Not truly serverless – but easy to deploy
Vercel AI (Edge) LLM streaming for web apps ~50ms (no GPU) Free tier + $0.01 per request (for LLM APIs) Serverless functions that call other LLM APIs, not self‑hosted

📦 Step‑by‑Step: Deploy a Custom Model on Replicate

Replicate makes it absurdly easy. You’ll need a cog.yaml and a predict.py.

1. Install Cog (Replicate’s build tool)

sudo curl -o /usr/local/bin/cog -L "https://github.com/replicate/cog/releases/latest/download/cog_$(uname -s)_$(uname -m)"
sudo chmod +x /usr/local/bin/cog

2. Create cog.yaml

build:
  gpu: true
  python_version: "3.11"
  python_packages:
    - torch
    - transformers
    - accelerate
predict: "predict.py:Predictor"

3. Write predict.py

from cog import BasePredictor, Input
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

class Predictor(BasePredictor):
    def setup(self):
        self.model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8b", torch_dtype=torch.float16).cuda()
        self.tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8b")

    def predict(self, prompt: str = Input(description="Input prompt")) -> str:
        inputs = self.tokenizer(prompt, return_tensors="pt").cuda()
        outputs = self.model.generate(**inputs, max_new_tokens=100)
        return self.tokenizer.decode(outputs[0])

4. Push to Replicate

cog push r8.im/yourusername/llama-3-8b

That’s it. Replicate builds the Docker image, provisions GPU on demand, and gives you an API endpoint. You can now call it from any language:

import replicate
output = replicate.run("yourusername/llama-3-8b", input={"prompt": "Explain serverless AI"})
print("".join(output))

⚡ Performance: Cold Starts and Keep‑Alive

The biggest downside of serverless is the cold start. When your model hasn’t been used for a few minutes (Replicate: ~5 minutes idle; Modal: 10‑15 minutes), the platform unloads it. The next request triggers a reload, which can take 5‑30 seconds depending on model size.

Mitigations:

  • Keep‑alive: Banana.dev and Modal allow you to keep one “warm” instance for a small hourly fee (e.g., $0.01‑$0.05/hour). This eliminates cold starts for that instance.
  • Background pings: Set up a cron job that calls your model every 4 minutes to keep it alive (slightly abuse of terms, but works).
  • Design for async: For batch processing, cold starts don’t matter – just queue requests and accept the first one being slow.

💰 Cost Comparison: Serverless vs. Dedicated GPU

Scenario A: 10,000 requests/month, each taking 2 seconds of GPU time.

  • Serverless (Replicate): 10,000 * 2s * $0.0005/s = $10 + $0.0001 per request = ~$11/month.
  • Dedicated GPU (AWS g4dn.xlarge ~$0.50/hour): 24/7 = $360/month.

Serverless wins hands down for low to medium volume.

Scenario B: 10 million requests/month, each 2 seconds. That’s 20M seconds of GPU time.

  • Serverless: 20M * $0.0005 = $10,000/month.
  • Dedicated (need ~5 GPUs at 80% utilization): 5 * $360 = $1,800/month.

Dedicated GPUs become cheaper at very high volume (millions of requests per day).

🛠️ When to Use Serverless AI (And When Not To)

✅ Perfect for:

  • Prototyping and MVPs – You don’t know your traffic yet.
  • Batch inference – Process 10,000 images once a week. Cold start doesn’t matter.
  • Sporadic APIs – A Slack bot that gets 100 requests a day.
  • Developers without DevOps skills – No Kubernetes required.

❌ Avoid for:

  • High‑throughput production (millions of requests/day) – Dedicated GPUs are cheaper.
  • Latency‑sensitive apps (<100ms) – Cold starts will kill you. Use keep‑alive or dedicated.
  • Models > 50GB – Some serverless platforms have size limits. Check Modal (supports up to 200GB volumes).
🚀 Pro tip: Hybrid deployment for growth
Start with Replicate or Modal for your MVP. As your traffic grows, add a dedicated GPU endpoint (e.g., using RunPod or AWS) for the most frequent requests, but keep serverless for the long tail. Use a simple router: if request count < 1000/day → serverless; else → dedicated. You'll always have the optimal cost.

🧪 Beyond Replicate: Modal for Complex Workflows

Modal is more powerful than Replicate – it’s a full serverless Python environment with GPU support. You can write multi‑step pipelines (download data → process → model inference → upload results) as a single Python script. Example:

import modal

stub = modal.Stub("my-rag-pipeline")

@stub.function(gpu="A100")
def embed_and_store(documents):
    # Your custom logic
    pass

@stub.local_entrypoint()
def main():
    embed_and_store.remote(my_docs)

Modal handles all the infrastructure. It’s like AWS Lambda but for AI.

Final verdict: Serverless AI is a game‑changer for 90% of AI projects. Only the top 10% of high‑volume, low‑latency applications need dedicated GPUs. Start serverless, and only move when you have proven scale.

Author: Jon-Paul Walton