- Published on
- · 13 min read
AI Agent Deployment: Cloud Platforms Compared for Ephemeral, Long-Running, and GPU Workloads (2026)
The AI agent deployment landscape has fractured into three distinct patterns in 2026, each with wildly different cost, latency, and durability tradeoffs. A Fly.io agent sleeping between requests costs $0.15/month. A Modal sandbox with an H100 GPU costs $3.95/hour. An always-on Hetzner VM costs $4/month. Choosing wrong means either burning money or building on infrastructure that can't handle your agent's actual workload.
This guide breaks down 20+ platforms across ephemeral, long-running, and GPU-accelerated deployments — with real pricing, architecture patterns, and how the top AI companies (Devin, Codex, Cursor, Anthropic) actually deploy their agents in production. For context on how AI agents use retrieval and reasoning, see my RAG techniques comparison guide.
Table of Contents
- The Three Deployment Patterns
- Ephemeral and Serverless Platforms
- Long-Running and Persistent Platforms
- GPU Cloud Pricing Compared
- Workflow Orchestration and Durability
- How Top AI Companies Deploy Agents
- Sandboxing and Security
- Cost Optimization Strategies
- Decision Framework
- Quick Reference
- Conclusion
The Three Deployment Patterns
Every AI agent deployment in 2026 falls into one of three patterns:
| Pattern | How It Works | Latency | Cost Profile | Best For |
|---|---|---|---|---|
| Ephemeral Sandbox | Spin up, execute task, destroy | Cold start + task time | Pay-per-second | Coding agents, untrusted code, one-shot tasks |
| Durable Workflow | Long-running with checkpointing | Seconds to hours | Per-step or per-hour | Multi-step agents, human-in-the-loop |
| Persistent Service | Always on or suspend/resume | Instant (warm) | Monthly or per-second | Chat agents, monitoring, always-available |
The right pattern depends on your agent's lifecycle. A coding agent that processes GitHub issues needs an ephemeral sandbox — spin up a container, clone the repo, execute, output a PR, destroy. A research agent that runs for hours across multiple data sources needs durable workflows with checkpointing. A customer support bot needs to be always available but is idle 95% of the time — suspend/resume is the sweet spot.
Ephemeral and Serverless Platforms
These platforms spin up compute on demand. You pay only for what you use, but cold starts and timeout limits matter.
| Platform | Max Timeout | GPU | GPU Cost/hr | Cold Start | Best For |
|---|---|---|---|---|---|
| Modal | 24 hours | H100, A100, T4, L4 | $0.59-3.95 | Less than 1 sec | AI sandboxes, GPU inference |
| Cloud Run | 60 min | L4, RTX PRO 6000 | ~$0.67 (L4) | Less than 5 sec (GPU) | Production containerized agents |
| GitHub Actions | 6 hours | T4 (16GB) | $3.12-6.12 | 15-45 sec | Coding agents, repo automation |
| AWS Lambda | 15 min | None | N/A | 100ms-10s | Short API orchestration |
| Cloudflare Workers | 5 min CPU | Workers AI | ~$0.011/1K neurons | ~1ms | Edge routing, lightweight agents |
| Vercel Functions | 5 min (Pro) | None | N/A | 250ms-1s | Frontend API layer only |
| Blacksmith | 6 hours | None | N/A | Faster than GH | Cheaper GitHub Actions |
| Azure Container Apps | Configurable | A100, T4 | ~$2.88-3.96 | Variable | Enterprise Azure workloads |
Modal — The AI Agent Platform Leader
Modal dominates AI agent sandboxed execution in 2026 ($87M Series B, $1.1B valuation). Sandboxes run on gVisor, autoscale to 50,000+ concurrent sessions, and start in under 1 second — with GPU access inside sandboxes.
The tradeoffs: Python SDK only (TypeScript agents need a wrapper), and sandbox pricing forces non-preemptible rates at ~3.75x advertised base rates. At 10K CPU-only runs/month (5-min average), expect ~$50-100. With T4 GPUs, that jumps to ~$500.
GitHub Actions — The Default Coding Agent Runtime
AI agent PRs on GitHub surged from 4M (Sep 2025) to 17M+ (Mar 2026). GitHub's new Agentic Workflows let you define agent workflows in plain Markdown with native repo access via the GitHub MCP Server. The 6-hour timeout per job is generous for most coding tasks.
Blacksmith is a drop-in replacement with 2x faster hardware (gaming-grade CPUs) at 50-67% lower cost and Docker layer caching. For coding agents with heavy dependency installs, Blacksmith cuts environment setup from minutes to seconds.
AWS Lambda — Limited but Cheap
The 15-minute hard timeout and zero GPU support make Lambda unsuitable for complex agents. It works for short API orchestration — an agent that receives a webhook, makes a few LLM API calls, and stores the result. SnapStart (2026) brings cold starts down to 90-140ms for Python. ARM64/Graviton2 cuts costs 20%.
Cloudflare Workers — Cheapest at Scale
At $5/month for 10M requests, Workers is unbeatable for lightweight agent routing. The 128MB memory limit is severely restrictive for AI workloads, but Durable Objects provide uniquely powerful stateful edge compute for agent state. Best used as the routing and orchestration layer that dispatches heavy work elsewhere.
Long-Running and Persistent Platforms
For agents that need to stay alive — either always-on or waking on demand.
CPU-Only Persistent Agent Comparison (2 vCPU, 4GB RAM, 24/7)
| Platform | Monthly Cost | Idle Cost | Auto-Stop | Complexity |
|---|---|---|---|---|
| Hetzner CX23 | $4.35 | N/A (always on) | Manual | Low |
| Fly.io | ~$6.64 | ~$0.15 (suspended) | Suspend/Resume | Low |
| Render Starter | $7-25 | N/A (always on) | None | Low |
| DigitalOcean | $12 | Billed (powered off) | Manual | Low |
| Railway | ~$20 | Sleep available | Dev only | Low |
| GCP e2-medium | ~$25 | $0 (suspended) | Via Scheduler | Medium |
| Azure ACI | ~$32 | $0 (stopped) | Via Functions | Medium |
| AWS Fargate | ~$36 | $0 (stopped) | Via Lambda | High |
Fly.io — The Idle Cost Killer
Fly.io's suspend/resume is the killer feature for agents that are bursty. It saves complete memory state to disk and resumes faster than a cold boot. A suspended machine costs only storage (~$0.15/GB/month) — meaning an agent that's active 5% of the time costs $0.15-0.50/month instead of $6-32/month always-on.
The Machines API gives you programmatic lifecycle control: POST /machines/{id}/suspend and POST /machines/{id}/start. Auto-stop/auto-start triggers on incoming requests, so your agent wakes automatically when needed.
Note: Fly.io GPUs are being deprecated after August 2026. CPU-only going forward.
Hetzner — Unbeatable for Always-On
CX23 (2 vCPU, 4GB, EU-only) at $4.35/month with 20TB transfer included. No GPU, no auto-scaling, no suspend/resume — but for a simple always-on agent that doesn't need idle management, nothing comes close on price.
Railway and Render — Deploy and Forget
Both offer GitHub-connected auto-deploys with minimal configuration. Railway charges $20/vCPU/mo + $10/GB/mo (usage-based). Render has background workers from $7/mo with auto-scaling on paid plans. Neither supports GPU or has meaningful idle cost savings. Choose these when simplicity matters more than optimization.
GPU Cloud Pricing Compared
GPU pricing varies enormously. For sustained inference, Lambda Cloud is cheapest. For bursty workloads, serverless options (Modal, Cloud Run) with scale-to-zero are more cost-effective.
A100 80GB On-Demand (Per Hour)
| Provider | On-Demand | Spot | Reserved |
|---|---|---|---|
| Spheron | $1.07 | $0.60 | — |
| Lambda Cloud | $1.29 | — | ~$0.90 |
| RunPod | $2.17 | — | — |
| CoreWeave | $2.21 | — | ~$1.00 |
| DigitalOcean | $2.99 | — | $1.99 |
| AWS (p4d) | ~$3.43 | ~$3.07 | ~$2.00 |
| GCP | ~$3.67 | ~$1.10 | ~$2.10 |
| Azure | ~$5.78+ | — | — |
H100 SXM On-Demand (Per Hour)
| Provider | On-Demand | Spot | Reserved |
|---|---|---|---|
| Lambda Cloud | $2.49 | — | ~$1.89 |
| Spheron | $2.50 | $1.03 | — |
| RunPod | $2.69 | — | — |
| DigitalOcean | $2.99 | — | $1.99 |
| CoreWeave | $4.76+ | — | ~$2.00 |
| AWS (p5) | ~$6.88 | ~$3.83 | — |
| GCP | ~$8.70 | ~$2.61 | — |
Lambda Cloud at $1.29/hr for an A100 80GB is roughly 3x cheaper than AWS on-demand. For sustained 24/7 inference, that's $929/month vs ~$2,470/month. Self-hosting breaks even in 3-6 months:
| GPU | VRAM | Price | Break-Even vs Cloud | Best For |
|---|---|---|---|---|
| RTX 5090 | 32 GB GDDR7 | ~$1,999 | 2-3 months | 32B+ models, long context |
| RTX 4090 | 24 GB GDDR6X | ~$1,000 (used) | 1-2 months | Best value used |
| RTX 5080 | 16 GB GDDR7 | ~$999 | 1-2 months | Sub-27B models |
| RTX 3090 | 24 GB GDDR6X | ~$700 (used) | Less than 1 month | Budget 24GB option |
Workflow Orchestration and Durability
For agents that run multi-step workflows spanning minutes to days, you need durable execution — automatic checkpointing, retry logic, and crash recovery.
| Platform | Durability | Multi-Agent | LLM Native | TypeScript | Self-Host | Best For |
|---|---|---|---|---|---|---|
| Temporal | Excellent | Good | Good | Yes | Yes (OSS) | Enterprise agents |
| Inngest | Good | Good | Excellent | Excellent | Yes | Serverless/event-driven |
| Trigger.dev | Good | Fair | Good | Excellent | Yes | Long-running TS tasks |
| LangGraph Platform | Good | Excellent | Excellent | Fair | Yes | LangChain teams |
| AWS Step Functions | Good | Fair | Fair | Fair | No | AWS-native teams |
| Lambda Durable (2026) | Excellent | Fair | Good | Yes | No | AWS serverless agents |
| BullMQ | Basic | None | None | Excellent | Yes | Simple job queues |
| CrewAI | Basic | Excellent | Excellent | No | Yes | Multi-agent prototypes |
Temporal — The Gold Standard
Temporal is used by Netflix, Stripe, and Snap for mission-critical workflows. For AI agents, every LLM call becomes a Temporal Activity with automatic retry and state persistence. The OpenAI Agents SDK integration (GA March 2026) makes this seamless.
The key advantage: on failure recovery, Temporal replays workflow code against event history without re-executing LLM calls — saving token costs. The downside is a steep learning curve (weeks to learn properly) and significant operational overhead if self-hosted ($26K-41K/month TCO with SREs). Temporal Cloud ($200-2,000/month) eliminates that burden.
Critical gotcha: The default retry policy causes retry storms on 429 rate limits. You must classify LLM API errors explicitly and use exponential backoff with jitter.
Inngest — Best for Serverless TypeScript
Inngest is the natural fit for Vercel/Next.js teams. Its AgentKit SDK provides multi-agent orchestration with shared Network State, and step.ai.wrap() turns any OpenAI/Anthropic call into a durable step with automatic checkpointing. The useAgent React hook streams real-time updates from durable workflows directly to the browser.
Self-hostable as a single service with Postgres or SQLite backend.
Trigger.dev — Best for Long-Running TypeScript
Where Inngest uses serverless endpoints, Trigger.dev runs on dedicated compute — better for heavy AI workloads. No timeout limits, so tasks can run for hours. The Realtime bridge streams LLM responses to your frontend during background execution. First-class Next.js integration and Apache 2.0 licensed.
AWS Lambda Durable Functions (New in 2026)
The biggest AWS development for agents: imperative durable execution directly inside Lambda handlers with transparent checkpointing. Maps "almost perfectly" onto AI agent orchestration patterns — iterative tool calls, conversation context across turns, human-in-the-loop. No new infrastructure needed, but full vendor lock-in.
The Consensus Pattern
The durable execution pattern for multi-step agents in 2026:
1. Each LLM call = Activity/Step (automatically checkpointed)
2. Each tool execution = Activity/Step (automatically retried)
3. Conversation state = Workflow state (persisted externally)
4. Recovery = Replay from last checkpoint (not from scratch)
This saves token costs on failure recovery and provides complete audit trails.
How Top AI Companies Deploy Agents
The most useful signal comes from looking at how companies running agents at scale actually architect their systems.
| Company | Infrastructure | Sandboxing | Pricing |
|---|---|---|---|
| Devin | AWS (multi-tenant or dedicated VPC) | Isolated "Devbox" per task | ~$8-9/hr active work |
| OpenAI Codex | OpenAI cloud containers | Fresh container, offline by default | 25-50 CU/task |
| Cursor | Cloud workers, outbound HTTPS | Isolated VMs, self-hosted option | $20/mo Pro |
| GitHub Copilot | GitHub Actions runners | Ephemeral Actions environment | $10-39/mo |
| Anthropic Claude Code | Local CLI + Managed Agents | Brain/Hands/Session separation | API tokens + $0.08/session-hr |
| Replit Agent | GCP (single-tenant enterprise) | Snapshot-based deployment | $25-100/mo |
The Brain-Hands-Session Pattern (Anthropic)
The most sophisticated architecture. Anthropic separates the LLM reasoning ("brain") from code execution ("hands") and state ("session"). Each component can fail independently and be replaced without losing progress. The session is a durable event log — on crash, the agent resumes from the last recorded event rather than starting over.
This pattern is worth studying even if you're not using Claude. The key insight: don't couple your agent's reasoning with its execution environment. If the sandbox crashes, the brain should be able to reconnect to a new sandbox and continue.
The Harness Pattern (Cursor)
Cursor's architecture uses a control loop ("harness") that manages AI inference and dispatches tool calls to workers. Workers use outbound-only HTTPS to connect to Cursor's cloud — no inbound ports, VPNs, or firewall changes needed. This makes self-hosted deployment dramatically simpler.
Their self-hosted option (March 2026) deploys via Helm chart + Kubernetes operator with a fleet management API for autoscaling. 35% of Cursor's own merged PRs are written by their autonomous cloud agents.
The Ephemeral Sandbox Pattern (Codex, E2B)
OpenAI Codex runs a fresh container per task with a two-phase runtime: setup phase (network enabled, install dependencies) and agent phase (network disabled by default unless explicitly enabled). This offline-by-default approach is the strongest security posture — the agent literally cannot exfiltrate data.
E2B provides the sandbox infrastructure used by many of these companies. Firecracker microVMs boot in under 200ms, and E2B scaled from 40K to 15M sandboxes/month in a single year. 88% of Fortune 100 have signed up.
Sandboxing and Security
If your agent executes untrusted code, the isolation method matters enormously.
| Method | Startup | Security Boundary | Use Case |
|---|---|---|---|
| Docker containers | ~500ms | Shared kernel (weak) | Trusted code only |
| gVisor | ~200ms | User-space kernel intercept | Medium trust |
| Firecracker microVM | ~125ms (28ms snapshots) | Hardware virtualization | Untrusted code |
| Kata Containers | ~300ms | Full VM boundary | Enterprise/compliance |
Docker containers share the host kernel. A kernel vulnerability allows container escape. MicroVM escape requires a hypervisor CVE — so rare these exploits command $250K-$500K bounties. For production agents executing arbitrary code, Firecracker (E2B) or gVisor (Modal) is the minimum acceptable isolation.
The industry has converged: E2B uses Firecracker, Modal uses gVisor, OpenAI Codex uses isolated containers with network disabled, Cursor uses isolated VMs. Docker alone is only acceptable for trusted, internally-developed agent code.
Agent Protocol Standards (2026)
Three protocols are competing to standardize agent communication:
- MCP (Model Context Protocol) — Anthropic, now under Linux Foundation. AI-to-tools integration
- A2A (Agent2Agent) — Google Cloud. Peer-to-peer agent communication
- ACP (Agent Communication Protocol) — IBM BeeAI. Simple HTTP agent messaging
MCP has the most adoption. A2A is gaining traction for multi-agent systems.
Cost Optimization Strategies
LLM API calls are 50-70% of a typical agent system's costs. Compute is 15-30%. Optimize LLM spend first.
Strategy 1: Suspend/Resume (Fly.io)
Agents that are active 5% of the time cost $0.15-0.50/month suspended vs $6-32/month always-on. The agent suspends on idle, resumes in seconds on incoming requests.
Strategy 2: Spot + Checkpointing
AWS/GCP Spot instances give 60-90% GPU savings. Checkpoint state every 5-10 minutes to S3/GCS. Auto-restart on interruption. Spotify saved 70% ($8.2M to $2.4M/year) with this pattern.
Strategy 3: Tiered Compute
Split the agent brain (cheap CPU, Hetzner ~$4/month) from GPU execution (serverless Modal/RunPod, on-demand only). The control plane is always on. GPU scales to zero when not processing.
Strategy 4: Model Routing
Frontier models for complex reasoning, mid-tier for standard tasks. One team reduced costs from $187/month to $78/month — a 58% reduction — just by routing simple queries to cheaper models.
Strategy 5: Reserved + On-Demand Mix
Reserve baseline GPU capacity (30-50% discount), burst with on-demand or spot for peaks. Lambda Cloud 1-year reserved H100: $1.89/hr vs $2.49/hr on-demand.
Decision Framework
Architecture Decision Tree
Does the agent need to run untrusted code?
Yes → E2B (Firecracker) or Modal (gVisor)
No →
Does it need GPU?
Yes → Sustained? → Lambda Cloud / CoreWeave
Bursty? → Modal / Cloud Run / RunPod
No →
How long does it run?
Less than 15 min → AWS Lambda or Cloud Run
15 min - 6 hr → GitHub Actions or Modal
Hours/days → Temporal + Fly.io/Railway
Always on → Hetzner ($4) or Fly.io ($2-7)
Mostly idle? → Fly.io suspend/resume ($0.15/mo)
By Agent Type
| Agent Type | Recommended Platform | Monthly Cost |
|---|---|---|
| Chat/QA bot (always available) | Fly.io suspend/resume | $2-7 |
| Coding agent (triggered by issues) | GitHub Actions + Blacksmith | $40-400 |
| Research agent (multi-step, hours) | Temporal + Fly.io | $25-200 |
| Data processing (batch) | Modal or Cloud Run | $50-500 |
| Image/video agent (GPU) | Modal (bursty) or Lambda Cloud (sustained) | $50-929 |
| Enterprise multi-agent | AWS ECS + Temporal + Modal | $200-5,000 |
By Team Profile
| Profile | Stack |
|---|---|
| Solo dev, TypeScript | Inngest or Trigger.dev + Fly.io |
| Startup, Python | Modal + Temporal Cloud |
| Enterprise, AWS | ECS/Fargate + Lambda Durable Functions |
| Enterprise, Azure | Container Apps + Agent Framework |
| Enterprise, GCP | Cloud Run + Vertex AI |
| Open-source preference | Temporal (self-hosted) + E2B/Daytona + CrewAI |
Quick Reference
Platform Cost at a Glance
| Platform | Minimum Cost | Best For | GPU |
|---|---|---|---|
| Fly.io (suspended) | $0.15/mo | Idle agents | No |
| Hetzner CX23 | $4.35/mo | Always-on CPU | No |
| Cloudflare Workers | $5/mo | Edge routing | Workers AI |
| Render | $7/mo | Simple background workers | No |
| Railway | $5/mo (+usage) | Auto-deploy from GitHub | No |
| AWS Lambda | Free tier | Short API tasks | No |
| GitHub Actions | Free tier | Coding agents | T4 |
| Modal | $30 free credits | GPU sandboxes | H100/A100/T4/L4 |
| Cloud Run | Free tier | Production containers | L4/RTX PRO 6000 |
| Lambda Cloud | $1.29/hr (A100) | Sustained GPU | A100/H100/B200 |
| Temporal Cloud | $200/mo | Durable workflows | N/A |
| E2B | $0.05/hr | Untrusted code sandboxes | No |
Orchestration at a Glance
| Platform | Durability | Best For | Pricing |
|---|---|---|---|
| Temporal | Enterprise-grade | Complex long-running agents | $200-2,000/mo cloud |
| Inngest | Good | Serverless TypeScript | Free tier + usage |
| Trigger.dev | Good | Long-running TS tasks | Free tier + usage |
| LangGraph Platform | Good | LangChain multi-agent | $0.001/node |
| AWS Step Functions | Good | AWS-native workflows | $0.025/1K transitions |
| BullMQ | Basic | Simple Redis queues | Free (OSS) |
Conclusion
The agent deployment landscape has matured into clear categories with clear winners. For ephemeral sandboxes, Modal and E2B dominate. For durable workflows, Temporal is the gold standard with Inngest and Trigger.dev as lighter alternatives for TypeScript teams. For persistent agents, Fly.io's suspend/resume pattern is the most cost-effective approach for bursty workloads, while Hetzner is unbeatable for truly always-on CPU agents.
The most important architectural insight from studying how Devin, Codex, and Cursor deploy: separate your agent's brain, hands, and state. The reasoning engine, the execution environment, and the session state should each be independently replaceable. When a sandbox crashes, the agent should reconnect to a new one and continue from its last checkpoint — not start over. This pattern, pioneered by Anthropic's managed agents, is becoming the industry standard for production agent systems.
Some links in this article are affiliate links. If you purchase through them, I may earn a small commission at no extra cost to you. I only recommend products I actually use.
