The Infrastructure Decision That Breaks Most AI Projects
You've got a model that works in development. Your data science team is happy with the results. Now you need to run it at scale - and suddenly the cost projections look nothing like what anyone expected.
This is where most AI projects hit a wall. Not because the model is wrong, but because the infrastructure underneath it wasn't chosen with the actual workload in mind. Scaling AI workloads is fundamentally an engineering problem, and it requires the same rigour you'd apply to any production system. Pick the wrong hardware, and you're either paying three times what you should, or watching your inference latency make the whole thing unusable.
This article walks through the practical hardware decisions - GPU versus TPU, on-premises versus cloud, and the configuration choices that sit between them - so you can make informed calls before you're locked into something expensive.
Understanding What Your Workload Actually Needs
Before you compare hardware specs, you need to characterise your workload. This sounds obvious, but it's skipped constantly.
The two primary workload types are training and inference, and they have very different hardware requirements.
Training is compute-intensive and memory-hungry. You're running large matrix operations repeatedly across your entire dataset, often for hours or days. You need high memory bandwidth, large GPU memory (VRAM), and fast interconnects between devices if you're distributing the job.
Inference is about latency and throughput. You're running a forward pass on individual requests or small batches, often in real time. Here, raw compute power matters less than how quickly you can load the model and return a result.
A third category - fine-tuning - sits between the two. You're not training from scratch, but you're still doing backward passes and gradient updates. It's more memory-intensive than pure inference but far less demanding than full pre-training.
Once you know which category (or combination) you're dealing with, hardware selection becomes much more tractable.
GPUs: The Workhorse of Most Production Deployments
NVIDIA's GPU ecosystem dominates AI infrastructure for practical reasons. The CUDA software stack is mature, the tooling is well-supported, and most frameworks - PyTorch, TensorFlow, JAX - have deep GPU integration.
For most Australian organisations doing training or fine-tuning at moderate scale, an NVIDIA A100 or H100 cluster is the default starting point. The H100 offers roughly three times the training performance of the A100 for transformer-based models, largely due to its Transformer Engine and support for FP8 precision. At the time of writing, H100s are still constrained in availability, which affects both cloud pricing and lead times for on-premises procurement.
Key GPU specs to evaluate:
- VRAM capacity - The A100 comes in 40GB and 80GB variants. The 80GB version is necessary for fine-tuning models above roughly 13 billion parameters without aggressive quantisation.
- Memory bandwidth - The H100 SXM5 delivers around 3.35 TB/s. This matters more for inference throughput than raw FLOPS.
- NVLink interconnect - Critical when you're distributing a single model across multiple GPUs. Without fast interconnects, you spend more time moving data than computing.
- TDP and power requirements - An H100 SXM5 draws 700W. At scale, power and cooling become real infrastructure constraints, not just footnotes.
For inference specifically, NVIDIA's L4 and L40S GPUs offer a better cost-to-performance ratio than the A100 or H100 for many use cases. They're designed for inference workloads and consume significantly less power.
TPUs: When Google's Hardware Makes Sense
Google's Tensor Processing Units are purpose-built for large-scale machine learning, and they perform extremely well on the workloads they were designed for. But they come with real constraints that limit their applicability outside specific scenarios.
TPUs are available through Google Cloud and are optimised for JAX and TensorFlow. PyTorch support exists via PyTorch/XLA, but it's not as seamless. If your team is already working in PyTorch - which is the case for most organisations - migrating to TPU-native workflows has a non-trivial engineering cost.
Where TPUs genuinely shine is in large-scale pre-training of transformer models. Google used TPU v4 pods to train PaLM (540 billion parameters), and the economics work when you're running jobs at that scale continuously. TPU v4 pods offer 1 exaFLOP of compute across 4,096 chips, with a high-bandwidth interconnect mesh that makes data-parallel training efficient.
The practical limitations:
- TPUs use a static compilation model. Your model graphs need to be compiled before execution, which makes dynamic shapes and control flow more complicated.
- Debugging is harder. The tooling is less mature than NVIDIA's ecosystem.
- You're tied to Google Cloud. There's no on-premises TPU option for most organisations.
- Cost efficiency depends heavily on utilisation. TPU pods are reserved capacity - if your workload is bursty, you're paying for idle hardware.
A reasonable rule of thumb: if you're doing large-scale training exclusively in JAX or TensorFlow, and you're running jobs continuously with high utilisation, TPUs are worth serious evaluation. For everything else, GPUs offer more flexibility.
Cloud Versus On-Premises: A Decision Framework
This is the question every organisation faces when scaling AI workloads, and the answer depends on your usage pattern, not your ideology.
Cloud infrastructure makes sense when:
- Your workload is bursty or unpredictable
- You're in early stages and don't know your steady-state requirements yet
- You need access to the latest hardware (H100s, TPU v5) before it's available for purchase
- Your team lacks the operational capacity to manage physical hardware
The major cloud providers - AWS, Google Cloud, Azure - all offer GPU instances. AWS P4d instances use A100s. Google Cloud offers both A100 and H100 options, plus TPU access. Azure's NC-series covers the A100 range. Spot and preemptible instances can reduce training costs by 60-80% if your jobs can handle interruption, which most training jobs can with proper checkpointing.
On-premises infrastructure makes sense when:
- You have consistent, high-utilisation workloads running most of the day
- Data sovereignty or security requirements constrain what you can put in a public cloud
- You've done the three-year TCO analysis and the numbers favour owned hardware
- You have the engineering team to manage it
A concrete example: a financial services client running daily batch inference across 50 million records found that after 18 months, their cloud GPU spend had exceeded the cost of purchasing equivalent on-premises A100 hardware. They moved to a hybrid model - on-premises for predictable batch workloads, cloud burst capacity for training runs and peak demand.
The hybrid approach is increasingly common and often the most pragmatic answer for mature AI deployments.
Optimising Before You Scale
A mistake that compounds quickly: scaling infrastructure before optimising the workload itself. More hardware doesn't fix inefficient code - it just makes inefficient code more expensive.
Before you add more GPUs or move to a larger instance type, work through these optimisation steps:
Model-level optimisations:
- Quantisation - Reducing precision from FP32 to FP16 or INT8 can halve memory requirements with minimal accuracy loss for most inference tasks. Tools like NVIDIA TensorRT and Hugging Face's
bitsandbyteslibrary make this accessible. - Pruning and distillation - For inference at scale, a distilled model that's 10% the size of the original and 90% as accurate may be the right trade-off.
- Batching - Dynamic batching groups multiple inference requests together. For many models, going from batch size 1 to batch size 16 increases throughput by 8-10x with minimal latency increase.
Infrastructure-level optimisations:
- Profile your GPU utilisation before adding capacity. If your GPUs are sitting at 40% utilisation, you have a scheduling or batching problem, not a capacity problem.
- Use mixed precision training (FP16 with FP32 master weights) to reduce memory usage and speed up training on modern GPUs.
- Analyse your data loading pipeline. In many training setups, the GPU is waiting on CPU-bound data preprocessing. Fixing this can improve training throughput more than upgrading the GPU.
Multi-GPU and Distributed Training Considerations
When a single GPU isn't enough, you need a distributed training strategy. There are three main approaches, and choosing the wrong one wastes both time and compute.
Data parallelism splits your dataset across multiple GPUs, each holding a full copy of the model. This works well when your model fits in a single GPU's memory. PyTorch's DistributedDataParallel (DDP) is the standard implementation.
Model parallelism splits the model itself across GPUs. This is necessary when the model is too large for a single device - for example, a 70-billion parameter model won't fit in a single 80GB A100. Tensor parallelism (splitting individual layers across devices) and pipeline parallelism (splitting the model depth-wise across devices) are the two variants.
ZeRO optimisation (from Microsoft's DeepSpeed library) is a hybrid approach that shards model states - parameters, gradients, and optimiser states - across GPUs. ZeRO Stage 3 can reduce per-GPU memory requirements by a factor of 64 with 64 GPUs. It's become a standard tool for large model training.
For most organisations scaling AI workloads beyond a single GPU for the first time, starting with DDP for data parallelism and adding DeepSpeed ZeRO when you hit memory limits is a sensible progression.
What to Do Next
If you're planning or currently managing AI infrastructure, here's a practical starting point:
- Characterise your workload - Separate training, fine-tuning, and inference requirements. Each may warrant different hardware.
- Profile before you provision - If you have existing GPU workloads, measure actual utilisation before assuming you need more capacity.
- Run a cost model over 36 months - Cloud versus on-premises decisions look very different at 12 months versus 36 months. Include power, cooling, and staffing in the on-premises calculation.
- Test quantisation on your inference workloads - For most models, INT8 quantisation with TensorRT or
bitsandbytesis a low-risk way to reduce hardware requirements. - Talk to your cloud provider's technical team - AWS, Google, and Azure all have specialist ML infrastructure teams. They can run workload assessments and often surface options that aren't obvious from the pricing pages.
If you'd like to work through the infrastructure decisions for your specific AI workloads, get in touch with the team at Exponential Tech. We work with Australian organisations to build AI infrastructure that's sized correctly from the start - not retrofitted after the costs blow out.