AI Agent Deployment: Cloud Platforms Compared for Ephemeral, Long-Running, and GPU Workloads (2026)

The AI agent deployment landscape has fractured into three distinct patterns in 2026, each with wildly different cost, latency, and durability tradeoffs. A Fly.io agent sleeping between requests costs $0.15/month. A Modal sandbox with an H100 GPU costs $3.95/hour. An always-on Hetzner VM costs $4/month. Choosing wrong means either burning money or building on infrastructure that can't handle your agent's actual workload.

This guide breaks down 20+ platforms across ephemeral, long-running, and GPU-accelerated deployments — with real pricing, architecture patterns, and how the top AI companies (Devin, Codex, Cursor, Anthropic) actually deploy their agents in production. For context on how AI agents use retrieval and reasoning, see my RAG techniques comparison guide.

> How LLMs Work: Premium Report

Get the 24-page PDF with 5 exclusive sections — Model Comparison Matrix, Parameter Calculation Worksheets, ML Interview Cheat Sheet, Annotated Paper Reading List, and 50+ term Glossary.

[Get the Premium Report — $19]

The Three Deployment Patterns
Ephemeral and Serverless Platforms
Long-Running and Persistent Platforms
GPU Cloud Pricing Compared
Workflow Orchestration and Durability
How Top AI Companies Deploy Agents
Sandboxing and Security
Cost Optimization Strategies
Decision Framework
Quick Reference
Conclusion

The Three Deployment Patterns

Every AI agent deployment in 2026 falls into one of three patterns:

Pattern	How It Works	Latency	Cost Profile	Best For
Ephemeral Sandbox	Spin up, execute task, destroy	Cold start + task time	Pay-per-second	Coding agents, untrusted code, one-shot tasks
Durable Workflow	Long-running with checkpointing	Seconds to hours	Per-step or per-hour	Multi-step agents, human-in-the-loop
Persistent Service	Always on or suspend/resume	Instant (warm)	Monthly or per-second	Chat agents, monitoring, always-available

The right pattern depends on your agent's lifecycle. A coding agent that processes GitHub issues needs an ephemeral sandbox — spin up a container, clone the repo, execute, output a PR, destroy. A research agent that runs for hours across multiple data sources needs durable workflows with checkpointing. A customer support bot needs to be always available but is idle 95% of the time — suspend/resume is the sweet spot.

Ephemeral and Serverless Platforms

These platforms spin up compute on demand. You pay only for what you use, but cold starts and timeout limits matter.

Platform	Max Timeout	GPU	GPU Cost/hr	Cold Start	Best For
Modal	24 hours	H100, A100, T4, L4	$0.59-3.95	Less than 1 sec	AI sandboxes, GPU inference
Cloud Run	60 min	L4, RTX PRO 6000	~$0.67 (L4)	Less than 5 sec (GPU)	Production containerized agents
GitHub Actions	6 hours	T4 (16GB)	$3.12-6.12	15-45 sec	Coding agents, repo automation
AWS Lambda	15 min	None	N/A	100ms-10s	Short API orchestration
Cloudflare Workers	5 min CPU	Workers AI	~$0.011/1K neurons	~1ms	Edge routing, lightweight agents
Vercel Functions	5 min (Pro)	None	N/A	250ms-1s	Frontend API layer only
Blacksmith	6 hours	None	N/A	Faster than GH	Cheaper GitHub Actions
Azure Container Apps	Configurable	A100, T4	~$2.88-3.96	Variable	Enterprise Azure workloads

Modal dominates AI agent sandboxed execution in 2026 ($87M Series B, $1.1B valuation). Sandboxes run on gVisor, autoscale to 50,000+ concurrent sessions, and start in under 1 second — with GPU access inside sandboxes.

The tradeoffs: Python SDK only (TypeScript agents need a wrapper), and sandbox pricing forces non-preemptible rates at ~3.75x advertised base rates. At 10K CPU-only runs/month (5-min average), expect ~$50-100. With T4 GPUs, that jumps to ~$500.

GitHub Actions — The Default Coding Agent Runtime

AI agent PRs on GitHub surged from 4M (Sep 2025) to 17M+ (Mar 2026). GitHub's new Agentic Workflows let you define agent workflows in plain Markdown with native repo access via the GitHub MCP Server. The 6-hour timeout per job is generous for most coding tasks.

Blacksmith is a drop-in replacement with 2x faster hardware (gaming-grade CPUs) at 50-67% lower cost and Docker layer caching. For coding agents with heavy dependency installs, Blacksmith cuts environment setup from minutes to seconds.

AWS Lambda — Limited but Cheap

The 15-minute hard timeout and zero GPU support make Lambda unsuitable for complex agents. It works for short API orchestration — an agent that receives a webhook, makes a few LLM API calls, and stores the result. SnapStart (2026) brings cold starts down to 90-140ms for Python. ARM64/Graviton2 cuts costs 20%.

Cloudflare Workers — Cheapest at Scale

At $5/month for 10M requests, Workers is unbeatable for lightweight agent routing. The 128MB memory limit is severely restrictive for AI workloads, but Durable Objects provide uniquely powerful stateful edge compute for agent state. Best used as the routing and orchestration layer that dispatches heavy work elsewhere.

Long-Running and Persistent Platforms

For agents that need to stay alive — either always-on or waking on demand.

CPU-Only Persistent Agent Comparison (2 vCPU, 4GB RAM, 24/7)

Platform	Monthly Cost	Idle Cost	Auto-Stop	Complexity
Hetzner CX23	$4.35	N/A (always on)	Manual	Low
Fly.io	~$6.64	~$0.15 (suspended)	Suspend/Resume	Low
Render Starter	$7-25	N/A (always on)	None	Low
DigitalOcean	$12	Billed (powered off)	Manual	Low
Railway	~$20	Sleep available	Dev only	Low
GCP e2-medium	~$25	$0 (suspended)	Via Scheduler	Medium
Azure ACI	~$32	$0 (stopped)	Via Functions	Medium
AWS Fargate	~$36	$0 (stopped)	Via Lambda	High

Fly.io — The Idle Cost Killer

Fly.io's suspend/resume is the killer feature for agents that are bursty. It saves complete memory state to disk and resumes faster than a cold boot. A suspended machine costs only storage (~$0.15/GB/month) — meaning an agent that's active 5% of the time costs $0.15-0.50/month instead of $6-32/month always-on.

The Machines API gives you programmatic lifecycle control: POST /machines/{id}/suspend and POST /machines/{id}/start. Auto-stop/auto-start triggers on incoming requests, so your agent wakes automatically when needed.

Note: Fly.io GPUs are being deprecated after August 2026. CPU-only going forward.

Hetzner — Unbeatable for Always-On

CX23 (2 vCPU, 4GB, EU-only) at $4.35/month with 20TB transfer included. No GPU, no auto-scaling, no suspend/resume — but for a simple always-on agent that doesn't need idle management, nothing comes close on price.

Railway and Render — Deploy and Forget

Both offer GitHub-connected auto-deploys with minimal configuration. Railway charges $20/vCPU/mo + $10/GB/mo (usage-based). Render has background workers from $7/mo with auto-scaling on paid plans. Neither supports GPU or has meaningful idle cost savings. Choose these when simplicity matters more than optimization.

> Railway

Deploy in Minutes

Railway deploys from GitHub with zero configuration — databases, cron jobs, and autoscaling included. No Dockerfiles required.

[Try Railway Free]

GPU Cloud Pricing Compared

GPU pricing varies enormously. For sustained inference, Lambda Cloud is cheapest. For bursty workloads, serverless options (Modal, Cloud Run) with scale-to-zero are more cost-effective.

A100 80GB On-Demand (Per Hour)

Provider	On-Demand	Spot	Reserved
Spheron	$1.07	$0.60	—
Lambda Cloud	$1.29	—	~$0.90
RunPod	$2.17	—	—
CoreWeave	$2.21	—	~$1.00
DigitalOcean	$2.99	—	$1.99
AWS (p4d)	~$3.43	~$3.07	~$2.00
GCP	~$3.67	~$1.10	~$2.10
Azure	~$5.78+	—	—

H100 SXM On-Demand (Per Hour)

Provider	On-Demand	Spot	Reserved
Lambda Cloud	$2.49	—	~$1.89
Spheron	$2.50	$1.03	—
RunPod	$2.69	—	—
DigitalOcean	$2.99	—	$1.99
CoreWeave	$4.76+	—	~$2.00
AWS (p5)	~$6.88	~$3.83	—
GCP	~$8.70	~$2.61	—

Lambda Cloud at $1.29/hr for an A100 80GB is roughly 3x cheaper than AWS on-demand. For sustained 24/7 inference, that's $929/month vs ~$2,470/month. Self-hosting breaks even in 3-6 months:

GPU	VRAM	Price	Break-Even vs Cloud	Best For
RTX 5090	32 GB GDDR7	~$1,999	2-3 months	32B+ models, long context
RTX 4090	24 GB GDDR6X	~$1,000 (used)	1-2 months	Best value used
RTX 5080	16 GB GDDR7	~$999	1-2 months	Sub-27B models
RTX 3090	24 GB GDDR6X	~$700 (used)	Less than 1 month	Budget 24GB option

Workflow Orchestration and Durability

For agents that run multi-step workflows spanning minutes to days, you need durable execution — automatic checkpointing, retry logic, and crash recovery.

Platform	Durability	Multi-Agent	LLM Native	TypeScript	Self-Host	Best For
Temporal	Excellent	Good	Good	Yes	Yes (OSS)	Enterprise agents
Inngest	Good	Good	Excellent	Excellent	Yes	Serverless/event-driven
Trigger.dev	Good	Fair	Good	Excellent	Yes	Long-running TS tasks
LangGraph Platform	Good	Excellent	Excellent	Fair	Yes	LangChain teams
AWS Step Functions	Good	Fair	Fair	Fair	No	AWS-native teams
Lambda Durable (2026)	Excellent	Fair	Good	Yes	No	AWS serverless agents
BullMQ	Basic	None	None	Excellent	Yes	Simple job queues
CrewAI	Basic	Excellent	Excellent	No	Yes	Multi-agent prototypes

Temporal — The Gold Standard

Temporal is used by Netflix, Stripe, and Snap for mission-critical workflows. For AI agents, every LLM call becomes a Temporal Activity with automatic retry and state persistence. The OpenAI Agents SDK integration (GA March 2026) makes this seamless.

The key advantage: on failure recovery, Temporal replays workflow code against event history without re-executing LLM calls — saving token costs. The downside is a steep learning curve (weeks to learn properly) and significant operational overhead if self-hosted ($26K-41K/month TCO with SREs). Temporal Cloud ($200-2,000/month) eliminates that burden.

Critical gotcha: The default retry policy causes retry storms on 429 rate limits. You must classify LLM API errors explicitly and use exponential backoff with jitter.

Inngest — Best for Serverless TypeScript

Inngest is the natural fit for Vercel/Next.js teams. Its AgentKit SDK provides multi-agent orchestration with shared Network State, and step.ai.wrap() turns any OpenAI/Anthropic call into a durable step with automatic checkpointing. The useAgent React hook streams real-time updates from durable workflows directly to the browser.

Self-hostable as a single service with Postgres or SQLite backend.

Trigger.dev — Best for Long-Running TypeScript

Where Inngest uses serverless endpoints, Trigger.dev runs on dedicated compute — better for heavy AI workloads. No timeout limits, so tasks can run for hours. The Realtime bridge streams LLM responses to your frontend during background execution. First-class Next.js integration and Apache 2.0 licensed.

AWS Lambda Durable Functions (New in 2026)

The biggest AWS development for agents: imperative durable execution directly inside Lambda handlers with transparent checkpointing. Maps "almost perfectly" onto AI agent orchestration patterns — iterative tool calls, conversation context across turns, human-in-the-loop. No new infrastructure needed, but full vendor lock-in.

The Consensus Pattern

The durable execution pattern for multi-step agents in 2026:

1. Each LLM call = Activity/Step (automatically checkpointed)
2. Each tool execution = Activity/Step (automatically retried)
3. Conversation state = Workflow state (persisted externally)
4. Recovery = Replay from last checkpoint (not from scratch)

This saves token costs on failure recovery and provides complete audit trails.

How Top AI Companies Deploy Agents

The most useful signal comes from looking at how companies running agents at scale actually architect their systems.

Company	Infrastructure	Sandboxing	Pricing
Devin	AWS (multi-tenant or dedicated VPC)	Isolated "Devbox" per task	~$8-9/hr active work
OpenAI Codex	OpenAI cloud containers	Fresh container, offline by default	25-50 CU/task
Cursor	Cloud workers, outbound HTTPS	Isolated VMs, self-hosted option	$20/mo Pro
GitHub Copilot	GitHub Actions runners	Ephemeral Actions environment	$10-39/mo
Anthropic Claude Code	Local CLI + Managed Agents	Brain/Hands/Session separation	API tokens + $0.08/session-hr
Replit Agent	GCP (single-tenant enterprise)	Snapshot-based deployment	$25-100/mo

The Brain-Hands-Session Pattern (Anthropic)

The most sophisticated architecture. Anthropic separates the LLM reasoning ("brain") from code execution ("hands") and state ("session"). Each component can fail independently and be replaced without losing progress. The session is a durable event log — on crash, the agent resumes from the last recorded event rather than starting over.

This pattern is worth studying even if you're not using Claude. The key insight: don't couple your agent's reasoning with its execution environment. If the sandbox crashes, the brain should be able to reconnect to a new sandbox and continue.

The Harness Pattern (Cursor)

Cursor's architecture uses a control loop ("harness") that manages AI inference and dispatches tool calls to workers. Workers use outbound-only HTTPS to connect to Cursor's cloud — no inbound ports, VPNs, or firewall changes needed. This makes self-hosted deployment dramatically simpler.

Their self-hosted option (March 2026) deploys via Helm chart + Kubernetes operator with a fleet management API for autoscaling. 35% of Cursor's own merged PRs are written by their autonomous cloud agents.

The Ephemeral Sandbox Pattern (Codex, E2B)

OpenAI Codex runs a fresh container per task with a two-phase runtime: setup phase (network enabled, install dependencies) and agent phase (network disabled by default unless explicitly enabled). This offline-by-default approach is the strongest security posture — the agent literally cannot exfiltrate data.

E2B provides the sandbox infrastructure used by many of these companies. Firecracker microVMs boot in under 200ms, and E2B scaled from 40K to 15M sandboxes/month in a single year. 88% of Fortune 100 have signed up.

Sandboxing and Security

If your agent executes untrusted code, the isolation method matters enormously.

Method	Startup	Security Boundary	Use Case
Docker containers	~500ms	Shared kernel (weak)	Trusted code only
gVisor	~200ms	User-space kernel intercept	Medium trust
Firecracker microVM	~125ms (28ms snapshots)	Hardware virtualization	Untrusted code
Kata Containers	~300ms	Full VM boundary	Enterprise/compliance

Docker containers share the host kernel. A kernel vulnerability allows container escape. MicroVM escape requires a hypervisor CVE — so rare these exploits command $250K-$500K bounties. For production agents executing arbitrary code, Firecracker (E2B) or gVisor (Modal) is the minimum acceptable isolation.

The industry has converged: E2B uses Firecracker, Modal uses gVisor, OpenAI Codex uses isolated containers with network disabled, Cursor uses isolated VMs. Docker alone is only acceptable for trusted, internally-developed agent code.

Agent Protocol Standards (2026)

Three protocols are competing to standardize agent communication:

MCP (Model Context Protocol) — Anthropic, now under Linux Foundation. AI-to-tools integration
A2A (Agent2Agent) — Google Cloud. Peer-to-peer agent communication
ACP (Agent Communication Protocol) — IBM BeeAI. Simple HTTP agent messaging

MCP has the most adoption. A2A is gaining traction for multi-agent systems.

Cost Optimization Strategies

LLM API calls are 50-70% of a typical agent system's costs. Compute is 15-30%. Optimize LLM spend first.

Strategy 1: Suspend/Resume (Fly.io)

Agents that are active 5% of the time cost $0.15-0.50/month suspended vs $6-32/month always-on. The agent suspends on idle, resumes in seconds on incoming requests.

Strategy 2: Spot + Checkpointing

AWS/GCP Spot instances give 60-90% GPU savings. Checkpoint state every 5-10 minutes to S3/GCS. Auto-restart on interruption. Spotify saved 70% ($8.2M to $2.4M/year) with this pattern.

Strategy 3: Tiered Compute

Split the agent brain (cheap CPU, Hetzner ~$4/month) from GPU execution (serverless Modal/RunPod, on-demand only). The control plane is always on. GPU scales to zero when not processing.

Strategy 4: Model Routing

Frontier models for complex reasoning, mid-tier for standard tasks. One team reduced costs from $187/month to $78/month — a 58% reduction — just by routing simple queries to cheaper models.

Strategy 5: Reserved + On-Demand Mix

Reserve baseline GPU capacity (30-50% discount), burst with on-demand or spot for peaks. Lambda Cloud 1-year reserved H100: $1.89/hr vs $2.49/hr on-demand.

Decision Framework

Architecture Decision Tree

Does the agent need to run untrusted code?
  Yes → E2B (Firecracker) or Modal (gVisor)
  No →
    Does it need GPU?
      Yes → Sustained? → Lambda Cloud / CoreWeave
             Bursty? → Modal / Cloud Run / RunPod
      No →
        How long does it run?
          Less than 15 min → AWS Lambda or Cloud Run
          15 min - 6 hr → GitHub Actions or Modal
          Hours/days → Temporal + Fly.io/Railway
          Always on → Hetzner ($4) or Fly.io ($2-7)
            Mostly idle? → Fly.io suspend/resume ($0.15/mo)

By Agent Type

Agent Type	Recommended Platform	Monthly Cost
Chat/QA bot (always available)	Fly.io suspend/resume	$2-7
Coding agent (triggered by issues)	GitHub Actions + Blacksmith	$40-400
Research agent (multi-step, hours)	Temporal + Fly.io	$25-200
Data processing (batch)	Modal or Cloud Run	$50-500
Image/video agent (GPU)	Modal (bursty) or Lambda Cloud (sustained)	$50-929
Enterprise multi-agent	AWS ECS + Temporal + Modal	$200-5,000

By Team Profile

Profile	Stack
Solo dev, TypeScript	Inngest or Trigger.dev + Fly.io
Startup, Python	Modal + Temporal Cloud
Enterprise, AWS	ECS/Fargate + Lambda Durable Functions
Enterprise, Azure	Container Apps + Agent Framework
Enterprise, GCP	Cloud Run + Vertex AI
Open-source preference	Temporal (self-hosted) + E2B/Daytona + CrewAI

Quick Reference

Platform Cost at a Glance

Platform	Minimum Cost	Best For	GPU
Fly.io (suspended)	$0.15/mo	Idle agents	No
Hetzner CX23	$4.35/mo	Always-on CPU	No
Cloudflare Workers	$5/mo	Edge routing	Workers AI
Render	$7/mo	Simple background workers	No
Railway	$5/mo (+usage)	Auto-deploy from GitHub	No
AWS Lambda	Free tier	Short API tasks	No
GitHub Actions	Free tier	Coding agents	T4
Modal	$30 free credits	GPU sandboxes	H100/A100/T4/L4
Cloud Run	Free tier	Production containers	L4/RTX PRO 6000
Lambda Cloud	$1.29/hr (A100)	Sustained GPU	A100/H100/B200
Temporal Cloud	$200/mo	Durable workflows	N/A
E2B	$0.05/hr	Untrusted code sandboxes	No

Orchestration at a Glance

Platform	Durability	Best For	Pricing
Temporal	Enterprise-grade	Complex long-running agents	$200-2,000/mo cloud
Inngest	Good	Serverless TypeScript	Free tier + usage
Trigger.dev	Good	Long-running TS tasks	Free tier + usage
LangGraph Platform	Good	LangChain multi-agent	$0.001/node
AWS Step Functions	Good	AWS-native workflows	$0.025/1K transitions
BullMQ	Basic	Simple Redis queues	Free (OSS)

Conclusion

The agent deployment landscape has matured into clear categories with clear winners. For ephemeral sandboxes, Modal and E2B dominate. For durable workflows, Temporal is the gold standard with Inngest and Trigger.dev as lighter alternatives for TypeScript teams. For persistent agents, Fly.io's suspend/resume pattern is the most cost-effective approach for bursty workloads, while Hetzner is unbeatable for truly always-on CPU agents.

The most important architectural insight from studying how Devin, Codex, and Cursor deploy: separate your agent's brain, hands, and state. The reasoning engine, the execution environment, and the session state should each be independently replaceable. When a sandbox crashes, the agent should reconnect to a new one and continue from its last checkpoint — not start over. This pattern, pioneered by Anthropic's managed agents, is becoming the industry standard for production agent systems.