Slash Your Cloud Bill by 65%: The AI-Native Infrastructure Lessons from Railway

Slash Your Cloud Bill by 65%: The AI-Native Infrastructure Lessons from Railway

When Your Cloud Bill Becomes the Product

Most engineering teams don't notice the problem until someone pulls up the monthly AWS or Azure invoice and asks an uncomfortable question: "Why are we spending more on infrastructure than on the engineers building the product?"

Railway, a developer infrastructure platform, found itself in exactly this position. After scaling to serve hundreds of thousands of developers, their cloud costs had grown faster than their revenue. The fix wasn't negotiating better reserved instance pricing or tagging resources more carefully. It was a fundamental rethink of how modern infrastructure should be built - particularly for workloads that are increasingly AI-driven.

The lessons from their journey are directly applicable to any Australian business running AI workloads on public cloud. And the numbers are worth paying attention to: Railway reported cost reductions of around 65% after moving significant portions of their infrastructure to a private cloud model built around AI-native principles.

This article breaks down what they did, why it worked, and how you can apply the same thinking to your own infrastructure.


The Hidden Tax of Legacy Cloud Architecture

Traditional cloud infrastructure was designed around a simple premise: pay for compute by the hour, scale horizontally when you need more, and let the cloud provider handle the hardware. For web applications with predictable traffic patterns, this model works reasonably well.

For AI-native cloud infrastructure, it breaks down quickly.

The core problem is that AI workloads have fundamentally different resource profiles than web workloads:

  • GPU utilisation is bursty and hard to predict. A model training run might saturate 8 x A100s for 6 hours, then sit idle for 12.
  • Memory bandwidth matters more than raw compute. Standard cloud instance families aren't optimised for the memory-intensive operations common in inference workloads.
  • Cold start latency is punishing. Spinning up a container with a 7B parameter model loaded into GPU memory takes time that HTTP-based auto-scaling wasn't designed to handle.
  • Egress costs compound aggressively. AI pipelines move large volumes of data - embeddings, model weights, training datasets - and public cloud providers charge for every byte that leaves their network.

Railway's engineers documented that a significant portion of their cloud spend was effectively overhead: paying for the abstraction layers, managed service margins, and network egress that came with running on hyperscale public cloud. None of that overhead was delivering value to their users.


What "AI-Native" Infrastructure Actually Means

The term gets used loosely, but for practical purposes, AI-native cloud infrastructure has a specific meaning: infrastructure designed from the ground up around the resource and operational patterns of AI workloads, rather than retrofitted from web-era assumptions.

In concrete terms, this means:

Hardware selection is workload-specific. Rather than choosing from a catalogue of general-purpose instance types, AI-native infrastructure starts with the question: what does this workload actually need? A real-time inference endpoint for a small language model has very different requirements from a batch embedding pipeline or a model fine-tuning job.

Scheduling is GPU-aware. Standard Kubernetes schedulers treat GPUs as a simple countable resource. AI-native schedulers understand GPU topology, NVLink interconnects, memory bandwidth constraints, and can make placement decisions that significantly improve utilisation.

Networking is designed for model distribution. Serving large models across multiple GPUs, or running distributed training across nodes, requires high-bandwidth, low-latency interconnects that standard cloud networking doesn't provide without significant additional cost.

Storage is tiered by access pattern. Model weights that are loaded once per deployment don't need the same storage tier as training data that's read sequentially at high throughput. Getting this wrong is a common source of unnecessary cost.

Railway rebuilt their infrastructure around these principles, which is why the cost improvements were structural rather than incremental.


The Private Cloud Calculation: When It Actually Makes Sense

Private cloud isn't right for every workload, and the decision deserves honest analysis rather than ideology. The economics shift in favour of private infrastructure when three conditions are met simultaneously:

  1. Workloads are predictable enough to justify reserved capacity. If you're running GPU inference 24/7, you're paying public cloud spot or on-demand prices for a resource you could own outright.
  2. You have the operational maturity to manage hardware. Private cloud shifts responsibility for hardware failure, network maintenance, and capacity planning onto your team.
  3. Your spend has crossed a threshold where the economics are clear. A rough rule of thumb: if you're spending more than $50,000 AUD per month on GPU compute in public cloud, the private cloud calculation is worth running seriously.

Railway's approach was hybrid rather than binary. They moved stable, predictable AI workloads to private infrastructure while retaining public cloud for burst capacity and geographic distribution. This is the pattern that makes the most operational sense for most organisations.

The cost reduction came from eliminating the managed service margin (typically 30-40% on top of raw hardware cost), eliminating egress charges between services that now ran on the same physical network, and improving GPU utilisation through better scheduling.


Developer Velocity: The Cost That Doesn't Appear on the Invoice

One of the less-discussed benefits of Railway's infrastructure rebuild was the impact on developer velocity. When infrastructure is expensive to run, teams optimise for cost rather than iteration speed. Developers wait for batch jobs to complete before running experiments. Model evaluation cycles stretch from hours to days. Feedback loops slow down.

This has a real cost, even if it doesn't show up in the cloud invoice.

Consider a scenario common in Australian enterprise AI teams: a data science team is fine-tuning a language model for a document classification use case. On public cloud, each training run costs $800-1,200 AUD in GPU compute. The team runs three or four experiments per week, carefully rationing their budget. The model takes six weeks to reach production quality.

On infrastructure optimised for AI workloads - with better GPU utilisation, no egress costs between training and evaluation infrastructure, and faster storage for dataset loading - the same experiments cost $200-300 each. The team runs experiments daily. The model reaches production quality in two weeks.

The infrastructure saving is real, but the four weeks of faster time-to-production is often worth more. AI deployment speed is a competitive advantage, and infrastructure that constrains it has a cost that finance teams rarely capture.


Practical Steps to Audit Your Current Infrastructure

Before making any architectural changes, you need an honest picture of where your money is going. Here's a structured approach:

Step 1: Categorise your GPU spend

Total GPU spend
├── Training workloads (how much? how utilised?)
├── Inference endpoints (always-on vs. on-demand)
├── Batch processing pipelines
└── Development and experimentation

Most teams find that development and experimentation GPU spend is 2-3x higher than it needs to be, simply because instances are left running between experiments.

Step 2: Measure actual GPU utilisation

Don't rely on cloud provider metrics, which measure instance availability rather than GPU utilisation. Deploy DCGM (Data Centre GPU Manager) or equivalent tooling and measure actual SM utilisation, memory bandwidth, and tensor core activity over a representative week.

If your average GPU utilisation is below 40%, you have significant room to improve before considering infrastructure changes.

Step 3: Quantify egress costs

Pull your egress line items specifically. Many teams are surprised to find that 15-25% of their total cloud spend is network egress - data moving between services, to end users, or to external APIs. For AI workloads moving large embeddings or model outputs, this can be even higher.

Step 4: Map your workload stability

Which of your AI workloads run continuously or on a predictable schedule? These are candidates for reserved capacity or private infrastructure. Which are genuinely bursty or experimental? These belong on public cloud.


What to Do Next

If your organisation is running meaningful AI workloads on public cloud, the Railway case study is worth taking seriously - not as a template to copy, but as evidence that the standard approach to cloud cost optimisation for AI has significant room for improvement.

Start with the audit steps above. The data will tell you where the leverage is. For most Australian businesses running AI workloads, the highest-impact interventions are:

  • Improving GPU utilisation before spending on more instances. Target 70%+ average utilisation before scaling out.
  • Consolidating egress-heavy services onto the same network, whether that's a single cloud region, a VPC with proper peering, or private infrastructure.
  • Separating training and inference infrastructure so each can be optimised independently rather than compromised to serve both use cases.
  • Implementing proper GPU scheduling if you're running Kubernetes - look at the NVIDIA GPU Operator and time-slicing configurations as a starting point.

If your monthly GPU spend is above $30,000 AUD and growing, it's worth having a detailed conversation about whether your current infrastructure architecture is the right fit for where your AI workloads are heading. The gap between well-optimised and poorly-optimised AI-native cloud infrastructure is large enough that it regularly determines whether AI initiatives are financially sustainable - or whether they quietly get defunded when the invoices land.

The 65% cost reduction Railway achieved wasn't magic. It was the result of making infrastructure decisions that matched the actual characteristics of their workloads. The same analysis is available to any team willing to do the work.

Related Service

AI Infrastructure & Optimisation

Right-sized infrastructure that scales with your AI workloads.

Learn More
Stay informed

Get AI insights delivered

Practical AI implementation tips for IT leaders — no hype, just what works.

Keep reading

Related articles

Ask about our services
Hi! I'm the Exponential Tech assistant. Ask me anything about our AI services — I'm here to help.