Published on
· 13 min read

AI Agent Deployment: Cloud Platforms Compared for Ephemeral, Long-Running, and GPU Workloads (2026)

The AI agent deployment landscape has fractured into three distinct patterns in 2026, each with wildly different cost, latency, and durability tradeoffs. A Fly.io agent sleeping between requests costs $0.15/month. A Modal sandbox with an H100 GPU costs $3.95/hour. An always-on Hetzner VM costs $4/month. Choosing wrong means either burning money or building on infrastructure that can't handle your agent's actual workload.

This guide breaks down 20+ platforms across ephemeral, long-running, and GPU-accelerated deployments — with real pricing, architecture patterns, and how the top AI companies (Devin, Codex, Cursor, Anthropic) actually deploy their agents in production. For context on how AI agents use retrieval and reasoning, see my RAG techniques comparison guide.

> How LLMs Work: Premium Report
Get the 24-page PDF with 5 exclusive sections — Model Comparison Matrix, Parameter Calculation Worksheets, ML Interview Cheat Sheet, Annotated Paper Reading List, and 50+ term Glossary.
[Get the Premium Report — $19]

Table of Contents

The Three Deployment Patterns

Every AI agent deployment in 2026 falls into one of three patterns:

PatternHow It WorksLatencyCost ProfileBest For
Ephemeral SandboxSpin up, execute task, destroyCold start + task timePay-per-secondCoding agents, untrusted code, one-shot tasks
Durable WorkflowLong-running with checkpointingSeconds to hoursPer-step or per-hourMulti-step agents, human-in-the-loop
Persistent ServiceAlways on or suspend/resumeInstant (warm)Monthly or per-secondChat agents, monitoring, always-available

The right pattern depends on your agent's lifecycle. A coding agent that processes GitHub issues needs an ephemeral sandbox — spin up a container, clone the repo, execute, output a PR, destroy. A research agent that runs for hours across multiple data sources needs durable workflows with checkpointing. A customer support bot needs to be always available but is idle 95% of the time — suspend/resume is the sweet spot.

Ephemeral and Serverless Platforms

These platforms spin up compute on demand. You pay only for what you use, but cold starts and timeout limits matter.

PlatformMax TimeoutGPUGPU Cost/hrCold StartBest For
Modal24 hoursH100, A100, T4, L4$0.59-3.95Less than 1 secAI sandboxes, GPU inference
Cloud Run60 minL4, RTX PRO 6000~$0.67 (L4)Less than 5 sec (GPU)Production containerized agents
GitHub Actions6 hoursT4 (16GB)$3.12-6.1215-45 secCoding agents, repo automation
AWS Lambda15 minNoneN/A100ms-10sShort API orchestration
Cloudflare Workers5 min CPUWorkers AI~$0.011/1K neurons~1msEdge routing, lightweight agents
Vercel Functions5 min (Pro)NoneN/A250ms-1sFrontend API layer only
Blacksmith6 hoursNoneN/AFaster than GHCheaper GitHub Actions
Azure Container AppsConfigurableA100, T4~$2.88-3.96VariableEnterprise Azure workloads

Modal dominates AI agent sandboxed execution in 2026 ($87M Series B, $1.1B valuation). Sandboxes run on gVisor, autoscale to 50,000+ concurrent sessions, and start in under 1 second — with GPU access inside sandboxes.

The tradeoffs: Python SDK only (TypeScript agents need a wrapper), and sandbox pricing forces non-preemptible rates at ~3.75x advertised base rates. At 10K CPU-only runs/month (5-min average), expect ~$50-100. With T4 GPUs, that jumps to ~$500.

GitHub Actions — The Default Coding Agent Runtime

AI agent PRs on GitHub surged from 4M (Sep 2025) to 17M+ (Mar 2026). GitHub's new Agentic Workflows let you define agent workflows in plain Markdown with native repo access via the GitHub MCP Server. The 6-hour timeout per job is generous for most coding tasks.

Blacksmith is a drop-in replacement with 2x faster hardware (gaming-grade CPUs) at 50-67% lower cost and Docker layer caching. For coding agents with heavy dependency installs, Blacksmith cuts environment setup from minutes to seconds.

AWS Lambda — Limited but Cheap

The 15-minute hard timeout and zero GPU support make Lambda unsuitable for complex agents. It works for short API orchestration — an agent that receives a webhook, makes a few LLM API calls, and stores the result. SnapStart (2026) brings cold starts down to 90-140ms for Python. ARM64/Graviton2 cuts costs 20%.

Cloudflare Workers — Cheapest at Scale

At $5/month for 10M requests, Workers is unbeatable for lightweight agent routing. The 128MB memory limit is severely restrictive for AI workloads, but Durable Objects provide uniquely powerful stateful edge compute for agent state. Best used as the routing and orchestration layer that dispatches heavy work elsewhere.

Long-Running and Persistent Platforms

For agents that need to stay alive — either always-on or waking on demand.

CPU-Only Persistent Agent Comparison (2 vCPU, 4GB RAM, 24/7)

PlatformMonthly CostIdle CostAuto-StopComplexity
Hetzner CX23$4.35N/A (always on)ManualLow
Fly.io~$6.64~$0.15 (suspended)Suspend/ResumeLow
Render Starter$7-25N/A (always on)NoneLow
DigitalOcean$12Billed (powered off)ManualLow
Railway~$20Sleep availableDev onlyLow
GCP e2-medium~$25$0 (suspended)Via SchedulerMedium
Azure ACI~$32$0 (stopped)Via FunctionsMedium
AWS Fargate~$36$0 (stopped)Via LambdaHigh

Fly.io — The Idle Cost Killer

Fly.io's suspend/resume is the killer feature for agents that are bursty. It saves complete memory state to disk and resumes faster than a cold boot. A suspended machine costs only storage (~$0.15/GB/month) — meaning an agent that's active 5% of the time costs $0.15-0.50/month instead of $6-32/month always-on.

The Machines API gives you programmatic lifecycle control: POST /machines/{id}/suspend and POST /machines/{id}/start. Auto-stop/auto-start triggers on incoming requests, so your agent wakes automatically when needed.

Note: Fly.io GPUs are being deprecated after August 2026. CPU-only going forward.

Hetzner — Unbeatable for Always-On

CX23 (2 vCPU, 4GB, EU-only) at $4.35/month with 20TB transfer included. No GPU, no auto-scaling, no suspend/resume — but for a simple always-on agent that doesn't need idle management, nothing comes close on price.

Railway and Render — Deploy and Forget

Both offer GitHub-connected auto-deploys with minimal configuration. Railway charges $20/vCPU/mo + $10/GB/mo (usage-based). Render has background workers from $7/mo with auto-scaling on paid plans. Neither supports GPU or has meaningful idle cost savings. Choose these when simplicity matters more than optimization.

> Railway
Deploy in Minutes
Railway deploys from GitHub with zero configuration — databases, cron jobs, and autoscaling included. No Dockerfiles required.
[Try Railway Free]

GPU Cloud Pricing Compared

GPU pricing varies enormously. For sustained inference, Lambda Cloud is cheapest. For bursty workloads, serverless options (Modal, Cloud Run) with scale-to-zero are more cost-effective.

A100 80GB On-Demand (Per Hour)

ProviderOn-DemandSpotReserved
Spheron$1.07$0.60
Lambda Cloud$1.29~$0.90
RunPod$2.17
CoreWeave$2.21~$1.00
DigitalOcean$2.99$1.99
AWS (p4d)~$3.43~$3.07~$2.00
GCP~$3.67~$1.10~$2.10
Azure~$5.78+

H100 SXM On-Demand (Per Hour)

ProviderOn-DemandSpotReserved
Lambda Cloud$2.49~$1.89
Spheron$2.50$1.03
RunPod$2.69
DigitalOcean$2.99$1.99
CoreWeave$4.76+~$2.00
AWS (p5)~$6.88~$3.83
GCP~$8.70~$2.61

Lambda Cloud at $1.29/hr for an A100 80GB is roughly 3x cheaper than AWS on-demand. For sustained 24/7 inference, that's $929/month vs ~$2,470/month. Self-hosting breaks even in 3-6 months:

GPUVRAMPriceBreak-Even vs CloudBest For
RTX 509032 GB GDDR7~$1,9992-3 months32B+ models, long context
RTX 409024 GB GDDR6X~$1,000 (used)1-2 monthsBest value used
RTX 508016 GB GDDR7~$9991-2 monthsSub-27B models
RTX 309024 GB GDDR6X~$700 (used)Less than 1 monthBudget 24GB option

Workflow Orchestration and Durability

For agents that run multi-step workflows spanning minutes to days, you need durable execution — automatic checkpointing, retry logic, and crash recovery.

PlatformDurabilityMulti-AgentLLM NativeTypeScriptSelf-HostBest For
TemporalExcellentGoodGoodYesYes (OSS)Enterprise agents
InngestGoodGoodExcellentExcellentYesServerless/event-driven
Trigger.devGoodFairGoodExcellentYesLong-running TS tasks
LangGraph PlatformGoodExcellentExcellentFairYesLangChain teams
AWS Step FunctionsGoodFairFairFairNoAWS-native teams
Lambda Durable (2026)ExcellentFairGoodYesNoAWS serverless agents
BullMQBasicNoneNoneExcellentYesSimple job queues
CrewAIBasicExcellentExcellentNoYesMulti-agent prototypes

Temporal — The Gold Standard

Temporal is used by Netflix, Stripe, and Snap for mission-critical workflows. For AI agents, every LLM call becomes a Temporal Activity with automatic retry and state persistence. The OpenAI Agents SDK integration (GA March 2026) makes this seamless.

The key advantage: on failure recovery, Temporal replays workflow code against event history without re-executing LLM calls — saving token costs. The downside is a steep learning curve (weeks to learn properly) and significant operational overhead if self-hosted ($26K-41K/month TCO with SREs). Temporal Cloud ($200-2,000/month) eliminates that burden.

Critical gotcha: The default retry policy causes retry storms on 429 rate limits. You must classify LLM API errors explicitly and use exponential backoff with jitter.

Inngest — Best for Serverless TypeScript

Inngest is the natural fit for Vercel/Next.js teams. Its AgentKit SDK provides multi-agent orchestration with shared Network State, and step.ai.wrap() turns any OpenAI/Anthropic call into a durable step with automatic checkpointing. The useAgent React hook streams real-time updates from durable workflows directly to the browser.

Self-hostable as a single service with Postgres or SQLite backend.

Trigger.dev — Best for Long-Running TypeScript

Where Inngest uses serverless endpoints, Trigger.dev runs on dedicated compute — better for heavy AI workloads. No timeout limits, so tasks can run for hours. The Realtime bridge streams LLM responses to your frontend during background execution. First-class Next.js integration and Apache 2.0 licensed.

AWS Lambda Durable Functions (New in 2026)

The biggest AWS development for agents: imperative durable execution directly inside Lambda handlers with transparent checkpointing. Maps "almost perfectly" onto AI agent orchestration patterns — iterative tool calls, conversation context across turns, human-in-the-loop. No new infrastructure needed, but full vendor lock-in.

The Consensus Pattern

The durable execution pattern for multi-step agents in 2026:

1. Each LLM call = Activity/Step (automatically checkpointed)
2. Each tool execution = Activity/Step (automatically retried)
3. Conversation state = Workflow state (persisted externally)
4. Recovery = Replay from last checkpoint (not from scratch)

This saves token costs on failure recovery and provides complete audit trails.

How Top AI Companies Deploy Agents

The most useful signal comes from looking at how companies running agents at scale actually architect their systems.

CompanyInfrastructureSandboxingPricing
DevinAWS (multi-tenant or dedicated VPC)Isolated "Devbox" per task~$8-9/hr active work
OpenAI CodexOpenAI cloud containersFresh container, offline by default25-50 CU/task
CursorCloud workers, outbound HTTPSIsolated VMs, self-hosted option$20/mo Pro
GitHub CopilotGitHub Actions runnersEphemeral Actions environment$10-39/mo
Anthropic Claude CodeLocal CLI + Managed AgentsBrain/Hands/Session separationAPI tokens + $0.08/session-hr
Replit AgentGCP (single-tenant enterprise)Snapshot-based deployment$25-100/mo

The Brain-Hands-Session Pattern (Anthropic)

The most sophisticated architecture. Anthropic separates the LLM reasoning ("brain") from code execution ("hands") and state ("session"). Each component can fail independently and be replaced without losing progress. The session is a durable event log — on crash, the agent resumes from the last recorded event rather than starting over.

This pattern is worth studying even if you're not using Claude. The key insight: don't couple your agent's reasoning with its execution environment. If the sandbox crashes, the brain should be able to reconnect to a new sandbox and continue.

The Harness Pattern (Cursor)

Cursor's architecture uses a control loop ("harness") that manages AI inference and dispatches tool calls to workers. Workers use outbound-only HTTPS to connect to Cursor's cloud — no inbound ports, VPNs, or firewall changes needed. This makes self-hosted deployment dramatically simpler.

Their self-hosted option (March 2026) deploys via Helm chart + Kubernetes operator with a fleet management API for autoscaling. 35% of Cursor's own merged PRs are written by their autonomous cloud agents.

The Ephemeral Sandbox Pattern (Codex, E2B)

OpenAI Codex runs a fresh container per task with a two-phase runtime: setup phase (network enabled, install dependencies) and agent phase (network disabled by default unless explicitly enabled). This offline-by-default approach is the strongest security posture — the agent literally cannot exfiltrate data.

E2B provides the sandbox infrastructure used by many of these companies. Firecracker microVMs boot in under 200ms, and E2B scaled from 40K to 15M sandboxes/month in a single year. 88% of Fortune 100 have signed up.

Sandboxing and Security

If your agent executes untrusted code, the isolation method matters enormously.

MethodStartupSecurity BoundaryUse Case
Docker containers~500msShared kernel (weak)Trusted code only
gVisor~200msUser-space kernel interceptMedium trust
Firecracker microVM~125ms (28ms snapshots)Hardware virtualizationUntrusted code
Kata Containers~300msFull VM boundaryEnterprise/compliance

Docker containers share the host kernel. A kernel vulnerability allows container escape. MicroVM escape requires a hypervisor CVE — so rare these exploits command $250K-$500K bounties. For production agents executing arbitrary code, Firecracker (E2B) or gVisor (Modal) is the minimum acceptable isolation.

The industry has converged: E2B uses Firecracker, Modal uses gVisor, OpenAI Codex uses isolated containers with network disabled, Cursor uses isolated VMs. Docker alone is only acceptable for trusted, internally-developed agent code.

Agent Protocol Standards (2026)

Three protocols are competing to standardize agent communication:

  • MCP (Model Context Protocol) — Anthropic, now under Linux Foundation. AI-to-tools integration
  • A2A (Agent2Agent) — Google Cloud. Peer-to-peer agent communication
  • ACP (Agent Communication Protocol) — IBM BeeAI. Simple HTTP agent messaging

MCP has the most adoption. A2A is gaining traction for multi-agent systems.

Cost Optimization Strategies

LLM API calls are 50-70% of a typical agent system's costs. Compute is 15-30%. Optimize LLM spend first.

Strategy 1: Suspend/Resume (Fly.io)

Agents that are active 5% of the time cost $0.15-0.50/month suspended vs $6-32/month always-on. The agent suspends on idle, resumes in seconds on incoming requests.

Strategy 2: Spot + Checkpointing

AWS/GCP Spot instances give 60-90% GPU savings. Checkpoint state every 5-10 minutes to S3/GCS. Auto-restart on interruption. Spotify saved 70% ($8.2M to $2.4M/year) with this pattern.

Strategy 3: Tiered Compute

Split the agent brain (cheap CPU, Hetzner ~$4/month) from GPU execution (serverless Modal/RunPod, on-demand only). The control plane is always on. GPU scales to zero when not processing.

Strategy 4: Model Routing

Frontier models for complex reasoning, mid-tier for standard tasks. One team reduced costs from $187/month to $78/month — a 58% reduction — just by routing simple queries to cheaper models.

Strategy 5: Reserved + On-Demand Mix

Reserve baseline GPU capacity (30-50% discount), burst with on-demand or spot for peaks. Lambda Cloud 1-year reserved H100: $1.89/hr vs $2.49/hr on-demand.

Decision Framework

Architecture Decision Tree

Does the agent need to run untrusted code?
  Yes → E2B (Firecracker) or Modal (gVisor)
  No →
    Does it need GPU?
      Yes → Sustained? → Lambda Cloud / CoreWeave
             Bursty? → Modal / Cloud Run / RunPod
      No →
        How long does it run?
          Less than 15 min → AWS Lambda or Cloud Run
          15 min - 6 hr → GitHub Actions or Modal
          Hours/days → Temporal + Fly.io/Railway
          Always on → Hetzner ($4) or Fly.io ($2-7)
            Mostly idle? → Fly.io suspend/resume ($0.15/mo)

By Agent Type

Agent TypeRecommended PlatformMonthly Cost
Chat/QA bot (always available)Fly.io suspend/resume$2-7
Coding agent (triggered by issues)GitHub Actions + Blacksmith$40-400
Research agent (multi-step, hours)Temporal + Fly.io$25-200
Data processing (batch)Modal or Cloud Run$50-500
Image/video agent (GPU)Modal (bursty) or Lambda Cloud (sustained)$50-929
Enterprise multi-agentAWS ECS + Temporal + Modal$200-5,000

By Team Profile

ProfileStack
Solo dev, TypeScriptInngest or Trigger.dev + Fly.io
Startup, PythonModal + Temporal Cloud
Enterprise, AWSECS/Fargate + Lambda Durable Functions
Enterprise, AzureContainer Apps + Agent Framework
Enterprise, GCPCloud Run + Vertex AI
Open-source preferenceTemporal (self-hosted) + E2B/Daytona + CrewAI

Quick Reference

Platform Cost at a Glance

PlatformMinimum CostBest ForGPU
Fly.io (suspended)$0.15/moIdle agentsNo
Hetzner CX23$4.35/moAlways-on CPUNo
Cloudflare Workers$5/moEdge routingWorkers AI
Render$7/moSimple background workersNo
Railway$5/mo (+usage)Auto-deploy from GitHubNo
AWS LambdaFree tierShort API tasksNo
GitHub ActionsFree tierCoding agentsT4
Modal$30 free creditsGPU sandboxesH100/A100/T4/L4
Cloud RunFree tierProduction containersL4/RTX PRO 6000
Lambda Cloud$1.29/hr (A100)Sustained GPUA100/H100/B200
Temporal Cloud$200/moDurable workflowsN/A
E2B$0.05/hrUntrusted code sandboxesNo

Orchestration at a Glance

PlatformDurabilityBest ForPricing
TemporalEnterprise-gradeComplex long-running agents$200-2,000/mo cloud
InngestGoodServerless TypeScriptFree tier + usage
Trigger.devGoodLong-running TS tasksFree tier + usage
LangGraph PlatformGoodLangChain multi-agent$0.001/node
AWS Step FunctionsGoodAWS-native workflows$0.025/1K transitions
BullMQBasicSimple Redis queuesFree (OSS)

Conclusion

The agent deployment landscape has matured into clear categories with clear winners. For ephemeral sandboxes, Modal and E2B dominate. For durable workflows, Temporal is the gold standard with Inngest and Trigger.dev as lighter alternatives for TypeScript teams. For persistent agents, Fly.io's suspend/resume pattern is the most cost-effective approach for bursty workloads, while Hetzner is unbeatable for truly always-on CPU agents.

The most important architectural insight from studying how Devin, Codex, and Cursor deploy: separate your agent's brain, hands, and state. The reasoning engine, the execution environment, and the session state should each be independently replaceable. When a sandbox crashes, the agent should reconnect to a new one and continue from its last checkpoint — not start over. This pattern, pioneered by Anthropic's managed agents, is becoming the industry standard for production agent systems.

> How LLMs Work: Premium Report
Get the 24-page PDF with 5 exclusive sections — Model Comparison Matrix, Parameter Calculation Worksheets, ML Interview Cheat Sheet, Annotated Paper Reading List, and 50+ term Glossary.
[Get the Premium Report — $19]

Some links in this article are affiliate links. If you purchase through them, I may earn a small commission at no extra cost to you. I only recommend products I actually use.

Share: