Machine Learning Systems: Architecture and Optimization (DR)
// Deep Research
1. Inference Engine Foundations
Modern ML inference engines vary from general-purpose runtimes to specialized compilers. ONNX Runtime is a flexible engine that loads models in the framework-neutral ONNX format and can partition the computation graph across multiple execution providers (EPs) like CPU, GPU, TensorRT, etc. It performs graph optimizations (constant folding, operator fusion) and then delegates subgraphs to the fastest available EP, with a default CPU execution provider as a fallback for any ops not supported on accelerators. TensorRT, by contrast, acts as an ahead-of-time (AOT) ML compiler for NVIDIA GPUs – it parses a model (often via ONNX), applies aggressive optimizations (layer fusion, precision reduction to FP16/INT8, etc.), and generates a highly optimized engine that runs on GPU hardware. TensorRT uses tactics like automatic kernel selection and autotuning to choose the fastest CUDA kernels for each layer, and it can considerably reduce inference latency by compiling models down to a streamlined set of GPU operations. Core ML (Apple's on-device engine) is built into iOS/macOS and similarly optimizes models for Apple's Neural Engine and GPU. At runtime, Core ML dynamically partitions the neural network graph across CPU, GPU, and the 16-core Neural Engine to maximize throughput, using FP16 math on the ANE/GPU and FP32 on CPU. This heterogeneous execution is analogous to ONNX Runtime's EP mechanism, ensuring each layer runs on the best-suited hardware, with fallback to CPU for full coverage. PyTorch JIT (TorchScript) embeds a just-in-time compiler within PyTorch. It can trace or script models into an IR, perform optimizations (e.g. fuse elementwise ops), and either execute that IR in a custom interpreter or hand off subgraphs to native libraries. For example, PyTorch with TorchScript can hand off parts of the model to TensorRT using Torch-TensorRT; in this flow, subgraphs that TensorRT supports are compiled into a TRT engine, while the remainder run on PyTorch's default runtime. This highlights how inference engines often interface with compilers: the runtime identifies segments of the computation that a compiler backend can handle and offloads them (e.g. via a Compile()
API in ONNX Runtime's EP plugin model). TensorFlow Serving, on the other hand, is a high-level serving system for TensorFlow models. It manages loading SavedModel graphs and uses TensorFlow's runtime (with optional XLA JIT compilation) to execute them. TF Serving is designed for production environments – it supports versioned model deployment, batched request handling, and can leverage TensorFlow's optimized kernels for CPUs and GPUs out of the box. In summary, ONNX Runtime and TF Serving act as general runtimes that can work with multiple models/hardware, whereas TensorRT and CoreML are more specialized code-generation engines for maximal performance on specific platforms, and PyTorch JIT lies somewhere in between (integrating compilation into a deep learning framework).
Inference Engines vs. ML Compilers: The line between runtime and compiler is increasingly blurred. Inference engines often include compilation steps (e.g. graph optimization or codegen) while compilers need runtime support (for memory management, etc.). Many engines use an IR (intermediate representation) as a contract between a compiler and the runtime. For instance, XLA (Accelerated Linear Algebra compiler) can compile a TensorFlow graph into optimized device-specific code (AOT), which the TF runtime then loads as an executable. ONNX Runtime uses its IR (the ONNX graph) and allows plugin compilers (EPs) to consume parts of it. The choice of execution model can be ahead-of-time vs. just-in-time: TensorRT generally performs one-time AOT compilation of the model to a GPU engine before inference, yielding very low per-inference overhead. PyTorch's JIT can do lazy compilation at runtime – it might incur overhead on the first few inferences but can specialize the code based on actual data shapes or allow mixing compiled and non-compiled ops. Ahead-of-time compilation tends to yield faster steady-state throughput (since the entire model is fused into optimized code), whereas JIT compilation offers flexibility (e.g. dynamic control flow or shape polymorphism) at the cost of some runtime overhead. In practice, systems often combine both: e.g. TensorFlow will use Grappler (graph optimizer) and maybe XLA to pre-compile static parts of the graph, while leaving truly dynamic portions to be executed by the runtime interpreter. The compiler-runtime interface typically involves the compiler producing either native machine code or an optimized graph of "kernels" that the runtime can execute. ONNX Runtime's EP mechanism essentially treats a compiled subgraph as a custom fused operator with its own kernel – after partitioning, the runtime just invokes that compiled kernel for the subgraph. This modular design allows swapping different compilers under the hood without changing the high-level API.
Operator Libraries, Kernel Selection, and Fallbacks: Inference engines rely heavily on optimized operator libraries for each hardware type. NVIDIA GPUs use cuDNN, cuBLAS, TensorRT's kernel library, etc., while CPUs may use oneDNN (MKL-DNN) or custom vectorized implementations. The engine must choose the right kernel for each operation considering device, data type (FP32/FP16/INT8), and tensor shapes. Many frameworks use autotuning: for example, TensorRT will benchmark or heuristically select the fastest convolution algorithm for the given shape and target hardware, and TensorFlow does this with cuDNN (trying several convolution algorithms on first run). Kernel selection can thus be static (via predefined heuristics or operator implementations) or dynamic (via profiling). Some engines maintain multiple implementations: e.g. an op could have a GPU kernel, an XLA compiled variant, and a fallback CPU kernel. If the preferred kernel is unavailable (or if the model uses an unsupported custom op), the engine falls back. ONNX Runtime explicitly guarantees a CPU implementation for every ONNX operator as a fallback. That means if, say, the TensorRT EP cannot handle a particular layer, ONNX Runtime will execute that layer on CPU so that the model still runs correctly (albeit slower). PyTorch JIT similarly falls back to the Python interpreter for any ops that are not TorchScript-compatible – the model continues working with a mix of compiled and uncompiled ops. This layered fallback approach ensures functional completeness: the inference can always proceed, with optimized kernels accelerating the parts they can. For extensibility, engines allow registering custom kernels (e.g. user-defined ops in ONNX Runtime or TensorFlow) so that new hardware or new operations can be integrated. In summary, inference systems are a co-design of compilation (graph-level transforms, code generation) and runtime (efficient execution of kernels, memory management, threading). A well-designed engine will maximize usage of high-performance kernels and compilers, but gracefully degrade to slower paths when needed, so that developers get both performance and portability.
2. Memory and Compute Optimization Techniques
To deploy and train large models efficiently, a variety of techniques target reduced memory usage and computational load:
Model Compression (Quantization, Pruning, Distillation): These methods produce smaller, faster models for inference. Quantization reduces precision of weights/activations (e.g. 32-bit float to 8-bit integers), cutting memory by 4× and often accelerating compute by using integer arithmetic. Modern CPUs and GPUs have optimized INT8 instructions, so quantized models can see up to 4× speed and memory improvement over FP32, and ~2× over FP16, with minimal accuracy loss if properly calibrated. Pruning removes unnecessary weights connections – for example, unstructured weight pruning can eliminate 80–90% of parameters in over-parameterized networks while preserving accuracy. This sparsity yields a smaller model (memory and storage footprint), though actual speedup on general hardware depends on sparse kernel support. Structured pruning (removing entire neurons or channels) tends to yield more actual speed gains than random sparsity, because it maps better to hardware. Research has shown ResNet-50 can be pruned to 90% sparsity with only ~1% accuracy drop, demonstrating how much redundancy many models have. Finally, knowledge distillation trains a smaller student model to replicate the outputs of a large teacher model. Distillation often produces a model that is much smaller and faster while retaining most of the accuracy of the big model. For instance, DistilBERT (a distilled version of BERT) is 40% fewer parameters and 60% faster than BERT, yet achieves about 97% of BERT's performance on language tasks. Such distilled models are easier to deploy on edge devices or under strict latency constraints. These compression techniques can be combined (e.g. one might prune then quantize a model) to multiplicatively reduce memory footprint and inference cost. They trade a bit of model capacity for substantial efficiency gains – as evidenced by TensorRT's integrated model optimizer which supports quantization, pruning, and distillation to compress models for faster inference.
Memory-Efficient Training Strategies: During training, especially of large models, GPU memory is often a bottleneck. Activation checkpointing (a.k.a. gradient checkpointing or recomputation) addresses this by not storing all intermediate activations; instead, certain layers' outputs are recomputed on the fly during backpropagation to save memory. This trades extra compute for reduced memory usage – for example, one can checkpoint only a subset of layers such that memory grows sub-linearly with network depth. In practice, activation recomputation can cut memory usage by ~50% or more at the cost of a modest increase in computation. This can be the difference between fitting a model on GPU or running out of memory (BERT Large, for instance, reportedly cannot even train without checkpointing on 16 GB GPUs). Gradient accumulation is another technique: instead of a single large batch that might overflow memory, the optimizer accumulates gradients over multiple smaller mini-batches and applies the update after, say, 8 iterations, effectively simulating a batch 8× larger. This allows training with large effective batch sizes even when device memory is limited, at the cost of longer training time (more iterations per update). Mixed precision training leverages lower precision for most operations. Using FP16 (half-precision) or BF16 reduces memory footprint and speeds up tensor operations on supported hardware (NVIDIA Tensor Cores, TPUs, etc.), while dynamic loss scaling or BF16's wider range maintain numerical stability. Mixed precision can dramatically accelerate training – many models see 1.5–2× speedups – with no degradation in final accuracy. It also cuts memory usage roughly in half for activations and gradients, enabling larger batch sizes or models to fit in VRAM. Frameworks like PyTorch AMP and TensorFlow's Automatic Mixed Precision handle casting and scaling, so users get the benefit with minimal code changes. In summary, these training-time optimizations (checkpointing, grad accumulation, mixed precision) mitigate memory limits and improve compute throughput, allowing bigger models to be trained on given hardware.
ZeRO Optimizer (Memory Sharding): For distributed training of enormous models, the ZeRO (Zero Redundancy Optimizer) approach divides memory heavy data structures across data-parallel processes. In traditional data parallelism, each GPU holds a full copy of the model parameters, gradients, and optimizer states, leading to a lot of memory duplication across workers. ZeRO eliminates this redundancy in stages: Stage 1 partitions optimizer states (e.g., Adam's moment vectors) across the GPUs, so each GPU only stores a slice of the optimizer memory instead of all of it. Stage 2 extends this by also sharding the gradients among GPUs – after each backward pass, gradients are reduced such that each GPU ends up with only the portion of gradients needed for its parameter shard. Stage 3 (the full ZeRO-3) even partitions the model parameters themselves, so no GPU holds the entire model weights – layers are distributed and gathered on the fly for forward/backward computations. With ZeRO-3, memory usage per GPU becomes almost independent of data parallel degree; effectively, N GPUs collectively have the memory of N copies of the model, rather than each having a copy. This enables training models with trillions of parameters by leveraging aggregate memory of a GPU cluster. Microsoft's DeepSpeed, which implements ZeRO, demonstrates that these stages allow scaling model size linearly with the number of GPUs, albeit with some communication overhead to coordinate the shards. The offloading extensions (ZeRO-Infinity) even allow paging shards to CPU or NVMe storage for out-of-core training. The result is that models that would normally require hundreds of GB of GPU memory can be trained with a fraction of that per GPU, by carefully orchestrating who holds which piece of data at any given time.
Transformer-Specific Optimizations (KV Cache): In autoregressive transformer inference (e.g. GPT-style text generation), a key-value cache of attention outputs is kept to avoid recomputing attention over past tokens at each new step. This speeds up generation since each new token's attention need only consider the new query against stored key/value vectors for previous tokens. However, the KV cache can consume a lot of memory, especially for long sequences and large models (it scales as sequence_length × hidden_size). Optimizing this cache is a focus of systems like vLLM. The vLLM engine uses a novel PagedAttention mechanism to manage the KV cache more efficiently, akin to virtual memory paging. It avoids fragmentation and reuses memory so well that existing systems which waste 60–80% of KV cache memory only waste <4% under vLLM. This allows serving many more concurrent text generation requests on the same GPU memory budget by dynamically paging parts of the cache in and out. Other techniques include quantizing the KV cache – since the cache is used for read-only attention scores, one can store it in lower precision (even 8-bit) to cut memory usage further. There are reports of quantizing KV to FP8 with negligible quality loss, roughly doubling the context length that can be cached for a given memory size. Additionally, frameworks handle KV cache as part of the model state in stateless environments (we'll touch on this in serverless inference). Effective KV cache management ensures per-token computation remains O(1) even for very long sequences, which is crucial for latency in large language model deployment.
Parallelism Strategies: To accelerate training and inference on distributed systems, one can exploit different forms of parallelism:
- Data Parallelism: each worker (GPU) has a full replica of the model and processes a different subset of the input data. After each training step, gradients are all-reduced across workers so that they stay in sync (for inference, data parallel just means each node handles different requests). Data parallelism is straightforward and scales well until communication costs (for synchronizing gradients) become a bottleneck.
- Model Parallelism: partition the model itself across multiple devices. Tensor (intra-layer) model parallelism splits the computations within a layer across devices – e.g. for a transformer with a large fully-connected layer, two GPUs each can hold part of the weight matrix and compute a portion of the matmul, combining results at the end. NVIDIA's Megatron-LM uses this approach to split very large matrix multiplications across GPUs. It requires fast interconnect (GPUs must exchange partial results) and careful mapping of operations so that each GPU stays busy. In Megatron, 8-way tensor parallelism allowed scaling models to billions of parameters per GPU fairly efficiently. However, as model size grows, pure tensor parallel hits diminishing returns (too much communication for too little compute per GPU).
- Pipeline (inter-layer) parallelism: segments the model's layers into stages and assigns each stage to a different device. For example, in a 4-stage pipeline, device 1 holds layers 1-5, device 2 holds layers 6-10, etc. During training/inference, different micro-batches of data are fed in a staggered fashion so that all pipeline stages work in parallel on different data. Pipeline parallelism reduces memory per device (each only holds a slice of the model) but introduces pipeline fill/flush overhead and increases latency for a single sample. Schemes like GPipe and PipeDream explored efficient scheduling of micro-batches to maximize throughput. NVIDIA's implementation combines pipeline parallelism across server nodes with tensor parallelism within each node to balance the load. By overlapping communication with computation and using enough micro-batches, pipeline parallelism can approach good efficiency, although it adds complexity in handling updates across stages (and introduces the concept of pipeline bubble – e.g., the first batch has to pass through all stages before full throughput is reached).
In practice, hybrid parallelism (combining the above) is used for ultra-large models. For example, a large transformer may use 8-way tensor parallelism for each layer and 4 pipeline stages, across e.g. 32 GPUs, plus data parallel across several such groups. This 3D parallelism was key to training trillion-parameter models like those in NVIDIA's Selene supercomputer experiment. They reported 114× throughput improvement when scaling from a 1 billion-param model on 32 GPUs to a 1 trillion-param model on 3072 GPUs by using 8-way tensor + 8-way pipeline parallelism (and data parallel groups). The massive scale (the latest MLPerf Training included a submission with over 10,000 accelerators in one job) shows that with the right parallel strategy, near-linear scaling is achievable. For inference serving, model parallelism (including pipeline) is mainly used to serve models that don't fit on one device or to reduce per-request latency by utilizing multiple devices per request. Data parallelism is more commonly used in inference for throughput (cloning the model to serve many queries concurrently). Overall, memory and compute optimizations – from compression to distributed parallelism – are what enable today's extremely large models to be trained and served in practice, by squeezing as much effective performance out of the hardware as possible.
3. Inference Pipelining and Scheduling
Inference serving must handle many requests efficiently under latency constraints. Batching is a fundamental technique: processing multiple inputs together often yields higher throughput due to better hardware utilization (especially on GPUs, which are throughput-oriented). There are different batching strategies:
- Static batching: requests are accumulated into a batch of a fixed size (or until a fixed interval passes). For example, always wait to have 32 images then run a single batch inference. This maximizes GPU utilization but can introduce delay when traffic is low (a request might wait in the batch queue).
- Dynamic batching: the system dynamically groups incoming requests into variable-size batches in real-time, up to some limit, and dispatches them as soon as possible. For instance, NVIDIA Triton Inference Server will fill a batch up to a max batch size or a timeout – whichever comes first – so that under low load you get smaller batches (lower latency) and under high load you get larger batches (higher efficiency). This approach adapts to traffic and typically yields a good throughput/latency trade-off.
- Adaptive/continuous batching: an advanced form of dynamic batching where the batch assembly is continuous and fine-grained. In the context of LLM serving, continuous batching processes streams token-by-token – as soon as one request finishes a token, a new request's token can take its spot in the batch for the next forward pass. This was demonstrated in the vLLM system with continuous batching, which can greatly improve GPU utilization for language generation by always keeping the batch full of tokens from any requests available. Adaptive batching algorithms may also adjust batch size on the fly based on measured latency (to meet an SLA). In summary, static batching is simplest but rigid, while dynamic/continuous batching can improve throughput without sacrificing latency too much by intelligently grouping queries.
Execution Scheduling: Beyond batching, servers must schedule how multiple models and requests run on limited resources. A simple strategy is greedy scheduling – execute whichever requests are available whenever the device is free. More complex scenarios (multiple models, multi-GPU pipelines, etc.) can be formulated as scheduling problems, sometimes as a DAG (directed acyclic graph) of tasks. For example, inference for a request might involve a sequence of CPU preprocessing, GPU inference, then CPU postprocessing – scheduling these with limited thread and GPU resources can be non-trivial. Algorithms like HEFT (Heterogeneous Earliest Finish Time) can be used to schedule DAGs on heterogeneous CPUs/GPUs by considering expected execution times on each. Academic proposals even use ILP (Integer Linear Programming) to optimally assign and order operations to minimize latency or maximize throughput. For instance, the Pesto system formulated the DNN partition+schedule problem as an ILP and achieved optimal placement/scheduling that outperformed heuristics, reducing training time by ~31% vs prior state-of-the-art. In inference serving, an ILP solver might decide how to allocate requests to a pool of devices to meet latency constraints – though ILP is often too slow for real-time decisions, so it's mainly a design-time tool. Practical schedulers might use simpler rules or learned policies. The key considerations are avoiding idle hardware (keep GPUs busy) and avoiding queuing delays for high-priority requests. Many serving systems implement priority queues or service-level objectives – e.g., a high-priority request can preempt batch building. Some research has looked at optimal scheduling of batch inference jobs to meet latency targets, which can be NP-hard, requiring approximations.
Overlapping Computation and Communication: High-performance distributed training taught us to overlap communication with computation; the same applies to distributed inference or even single-device scenarios with I/O. In multi-GPU inference pipelines, one can overlap data transfers (e.g. copying input to GPU) with GPU computation of previous inference. Using separate threads/streams, a server might be feeding the next batch to a GPU (or sending data over the network to another node) while the current batch is still processing. This pipelining reduces bubbles. For example, if model partitioning is used (half the layers on GPU0, half on GPU1), then GPU0 can start processing request N+1 while GPU1 is finishing request N – so long as GPU1 receives the intermediate outputs in time. Overlapping is crucial for good device utilization. In a multi-node pipeline, this becomes a producer-consumer scenario; algorithms like PipeDream's "wait-free backprop" essentially overlap forward passes, backward passes, and parameter updates across pipeline stages. In inference, simpler: just use asynchronous calls so that as soon as a GPU finishes, it's immediately given new work (possibly already prepared while it was busy). Communication overlaps also apply to networking – e.g. streaming data out as soon as results are ready rather than waiting for the entire batch to finish.
Model Partitioning and Placement on Heterogeneous Hardware: In some deployments, different hardware accelerators coexist (CPU, GPU, FPGA, NPU). Deciding which parts of a model run on which device can significantly impact performance. A classic example is running a CNN's conv layers on a GPU (fast for dense arithmetic) but maybe running a small decision tree or post-processing on the CPU. ONNX Runtime's graph partitioner can split a model so that supported ops run on, say, an FPGA and the rest run on CPU. The placement problem can be complex: one must consider device speeds, memory capacity, and the cost of moving data between devices. Research like Moirai and others use cost models or even ILP to find an optimal partition that minimizes overall latency given heterogeneous device profiles. For instance, if an accelerator has very limited memory, you might assign only the most compute-intensive layers to it and leave earlier layers on CPU to reduce data transfer. Heterogeneous placement is increasingly relevant with specialized chips (like a vision DSP + CPU in mobile phones). The goal is to fully utilize each component: e.g., Apple's Core ML will offload what it can to the Neural Engine, then GPU, and use CPU as last resort, in a way that maximizes total throughput. Model parallelism across hetero devices also introduces scheduling needs – e.g., if the CPU part and GPU part run concurrently on different requests (to overlap), one needs to schedule those tasks carefully. Some systems treat heterogeneous inference as a pipeline (with each device as a stage).
Request-Level vs. Operator-Level Pipelining: We can pipeline work at different granularities. Request-level pipelining refers to the scenario where an inference service has multiple processing stages (e.g., feature extraction, inference, result aggregation) and different requests can be in different stages simultaneously. For example, in an online recommendation service, while Request A's model inference is running on the GPU, Request B might already be in the feature lookup stage on CPU – so the requests are processed in an overlapped pipeline. This improves throughput and reduces overall response time by utilizing all components in parallel. It's essentially the classic web server pipeline: parse request -> inference -> format response, each step handled by different threads or workers in parallel for different requests. Operator-level pipelining, on the other hand, pipelines the execution within a single request's model. This is similar to model pipeline parallelism discussed earlier: different layers (operators) of the neural network are assigned to different processors, and successive layers operate like an assembly line on the data. For instance, one can split a 12-layer transformer between two GPUs (6 layers each). While GPU2 is processing layer 7–12 of sample #1, GPU1 can start processing layer 1–6 of sample #2 – thus two samples are in the "operator pipeline". The distinction is basically whether the pipeline is across different requests or within one request's model inference. In practice, operator-level pipelining is used to reduce latency for a single inference when one device alone is too slow or doesn't have enough memory – you achieve parallelism at a finer granularity. Request-level pipelining is more about overall throughput, ensuring high utilization when there are concurrent requests. Both can be combined: e.g., a pipeline-parallel model serving multiple requests at once (each request enters the pipeline staggered). The important consideration is that operator-level pipelining requires splitting the model (which can increase inter-device communication), whereas request-level pipelining requires a multi-stage workflow (which many inference applications naturally have).
In summary, inference optimization involves packing work (batching) and ordering work (scheduling) to best use hardware, overlapping wherever possible. Greedy and heuristic schedulers work well in many cases, but as systems scale, more sophisticated scheduling (even learned or optimized via ILP solvers offline) can squeeze out extra performance. The end goal remains the same: maximize throughput and minimize latency jitter, ensuring each hardware resource (CPU core, GPU, NIC) is doing useful work as much of the time as possible.
4. Server-Based vs. Serverless ML Deployments
ML models can be deployed on dedicated servers/containers or via serverless function services, each with trade-offs:
Dedicated Server (Monolithic or Containerized) Deployments: In the traditional approach, one or more server instances are allocated for the ML service and run 24/7. For example, a Kubernetes cluster might be running a set of pods each serving a model. These servers are always on and typically have the model loaded in memory. The capacity (throughput) is provisioned either statically or via auto-scaling – e.g., scaling up the number of replicas during high load periods. This model avoids startup latency (since the process is long-lived and warm) – thus it offers low latency for requests once the server is up. It also allows maximum control over resources and custom libraries (you can use GPUs, specialized hardware, etc., as needed). However, during idle times, the server is still running (wasting resources). Auto-scaling can partly mitigate this by shutting down instances during low traffic, but scaling in/out isn't instantaneous either (starting a new VM or container might take tens of seconds). The cost model here is typically paying for reserved resources (VMs or containers) regardless of usage, which can be inefficient for spiky or infrequent workloads. On the other hand, for steady high-volume services, dedicated instances are often more cost-effective and predictable in performance than serverless.
Serverless (FaaS) Deployments: Serverless platforms like AWS Lambda, Azure Functions, or Google Cloud Functions allow you to deploy model inference as a function that scales automatically and runs only when invoked. The major benefit is no need to manage servers – the cloud provider handles provisioning containers and scaling them based on incoming events. And you pay per invocation (per 100ms of execution, for example) rather than for idle time. This is ideal for workloads with intermittent traffic: "on-demand Serverless Inference is ideal for workloads with idle periods between traffic spurts". For instance, if a model is used only a few times an hour, serverless can be much cheaper than running a dedicated VM all hour. Serverless can also rapidly scale out to many parallel executions if traffic spikes, without the operator needing to pre-provision capacity – the platform will spawn as many function instances as needed (within limits). However, cold start latency is a primary drawback. When a new serverless instance is invoked after being idle, it must initialize – the container is started and the model must be loaded from storage into memory. Cold starts for ML models can be significant (hundreds of milliseconds to seconds, especially for large models). As one study notes, traditional serverless has many optimizations for quick startup with small code, but "large AI models introduce new complexities", such as loading potentially gigabyte-sized weights and initializing accelerator contexts. For example, loading a 1GB model from disk and CUDA initializing a GPU might take several seconds – a big added latency for the first request. Cloud providers have introduced mitigations: AWS Lambda has a provisioned concurrency option (keep N instances warm), and services like AWS SageMaker Serverless Inference keep idle instances around for a short time after use so subsequent requests hit a warm container. In fact, AWS's serverless GPU inference keeps some GPUs "warm" and loaded with model weights to reduce cold start frequency. Despite this, unpredictable traffic that scales from zero will inevitably incur cold start delays. Serverless is thus best when occasional higher latency is acceptable or workloads naturally have enough steady traffic to keep instances warm.
Resource Allocation and Scaling: In dedicated deployments, one often uses auto-scaling (horizontal scaling by adding/removing instances, or vertical by adjusting resources) to handle variability. This can be based on CPU/GPU utilization or queue latency. Auto-scaling has a reaction time (a new VM might take minutes, a new container maybe tens of seconds) and typically you set min/max capacities. In serverless, scaling is event-driven and rapid: each incoming request can potentially spawn a new function instance immediately (within the platform's limits). The scaling is transparent to the user – you just see more functions running under the hood when load increases, and they disappear when load subsides. This means serverless can handle sudden bursts without manual intervention, whereas a server-based deployment might need headroom or fast scale rules configured. Serverless functions are also ephemeral: after finishing a request (or a short series of batched requests), the instance might be torn down. This leads to efficient resource usage (no idle time costs), but also means state (like a loaded model) might be lost unless the platform keeps the container around for reuse. The cloud cost structure differs: server-based means paying for continuous uptime of instances (possibly with reserved instance discounts), serverless means paying per execution time and memory. For unpredictable or low average utilization, serverless is cost-efficient; for consistently high utilization, dedicated servers can be cheaper because the per-invocation overhead and limits of serverless might add cost. Many organizations adopt a hybrid strategy: e.g., maintain a baseline of dedicated servers for steady load and burst to serverless for spikes, or use serverless for lightweight pre/post-processing while keeping the heavy model on a persistent server.
Cold Start Latency: As mentioned, cold starts are a key concern in serverless ML. A major portion of that latency is loading model weights from remote storage into memory and initializing the framework. For small models, this might be tens of milliseconds; for huge models, it can be minutes if not addressed. The ServerlessLLM project has shown improvements by accelerating model loading to approach hardware bandwidth limits (e.g., using multi-stream reads, or memory-mapped weights). They also suggest smarter checkpoint placement and caching – e.g., storing models in a way that parts of it can be memory-mapped on demand, or keeping a cache of model in memory across warm invocations. Cloud providers sometimes provide provisioned concurrency (as AWS does) where a number of function instances are kept initialized (you pay for that warm pool). This gives the best of both worlds: functions remain "warm" with predictable performance, but you still have auto-scaling beyond that with cold start for extras. Cold starts also involve accelerator context init: e.g. starting a CUDA context on a GPU or loading libraries, which can add a few hundred milliseconds even if the model is in memory. If using CPUs only, cold start is usually faster (tens to a couple hundred ms). Some serverless platforms now offer GPU support (e.g., Lambda GPU). Those will naturally face these same issues of loading big models onto GPUs.
State Management in Serverless ML: Traditional FaaS assumes each invocation is stateless (no carry-over state except what you externalize to a database or cache). But ML inference often benefits from state being kept in memory: the model weights themselves are a form of state – it's expensive to reload them each time. Also, things like a transformer's key-value cache (for streaming inference) or other context need to persist across invocations for efficiency. In a multi-request conversation with an LLM, if each message is a separate function call, the KV cache from previous tokens ideally should be reused; but a pure stateless model would recompute everything if it doesn't retain state. One solution is to