vesicularia

Machine Learning Systems: Architecture and Optimization (DR)

// Deep Research

1. Inference Engine Foundations

Modern ML inference engines vary from general-purpose runtimes to specialized compilers. ONNX Runtime is a flexible engine that loads models in the framework-neutral ONNX format and can partition the computation graph across multiple execution providers (EPs) like CPU, GPU, TensorRT, etc. It performs graph optimizations (constant folding, operator fusion) and then delegates subgraphs to the fastest available EP, with a default CPU execution provider as a fallback for any ops not supported on accelerators. TensorRT, by contrast, acts as an ahead-of-time (AOT) ML compiler for NVIDIA GPUs – it parses a model (often via ONNX), applies aggressive optimizations (layer fusion, precision reduction to FP16/INT8, etc.), and generates a highly optimized engine that runs on GPU hardware. TensorRT uses tactics like automatic kernel selection and autotuning to choose the fastest CUDA kernels for each layer, and it can considerably reduce inference latency by compiling models down to a streamlined set of GPU operations. Core ML (Apple's on-device engine) is built into iOS/macOS and similarly optimizes models for Apple's Neural Engine and GPU. At runtime, Core ML dynamically partitions the neural network graph across CPU, GPU, and the 16-core Neural Engine to maximize throughput, using FP16 math on the ANE/GPU and FP32 on CPU. This heterogeneous execution is analogous to ONNX Runtime's EP mechanism, ensuring each layer runs on the best-suited hardware, with fallback to CPU for full coverage. PyTorch JIT (TorchScript) embeds a just-in-time compiler within PyTorch. It can trace or script models into an IR, perform optimizations (e.g. fuse elementwise ops), and either execute that IR in a custom interpreter or hand off subgraphs to native libraries. For example, PyTorch with TorchScript can hand off parts of the model to TensorRT using Torch-TensorRT; in this flow, subgraphs that TensorRT supports are compiled into a TRT engine, while the remainder run on PyTorch's default runtime. This highlights how inference engines often interface with compilers: the runtime identifies segments of the computation that a compiler backend can handle and offloads them (e.g. via a Compile() API in ONNX Runtime's EP plugin model). TensorFlow Serving, on the other hand, is a high-level serving system for TensorFlow models. It manages loading SavedModel graphs and uses TensorFlow's runtime (with optional XLA JIT compilation) to execute them. TF Serving is designed for production environments – it supports versioned model deployment, batched request handling, and can leverage TensorFlow's optimized kernels for CPUs and GPUs out of the box. In summary, ONNX Runtime and TF Serving act as general runtimes that can work with multiple models/hardware, whereas TensorRT and CoreML are more specialized code-generation engines for maximal performance on specific platforms, and PyTorch JIT lies somewhere in between (integrating compilation into a deep learning framework).

Inference Engines vs. ML Compilers: The line between runtime and compiler is increasingly blurred. Inference engines often include compilation steps (e.g. graph optimization or codegen) while compilers need runtime support (for memory management, etc.). Many engines use an IR (intermediate representation) as a contract between a compiler and the runtime. For instance, XLA (Accelerated Linear Algebra compiler) can compile a TensorFlow graph into optimized device-specific code (AOT), which the TF runtime then loads as an executable. ONNX Runtime uses its IR (the ONNX graph) and allows plugin compilers (EPs) to consume parts of it. The choice of execution model can be ahead-of-time vs. just-in-time: TensorRT generally performs one-time AOT compilation of the model to a GPU engine before inference, yielding very low per-inference overhead. PyTorch's JIT can do lazy compilation at runtime – it might incur overhead on the first few inferences but can specialize the code based on actual data shapes or allow mixing compiled and non-compiled ops. Ahead-of-time compilation tends to yield faster steady-state throughput (since the entire model is fused into optimized code), whereas JIT compilation offers flexibility (e.g. dynamic control flow or shape polymorphism) at the cost of some runtime overhead. In practice, systems often combine both: e.g. TensorFlow will use Grappler (graph optimizer) and maybe XLA to pre-compile static parts of the graph, while leaving truly dynamic portions to be executed by the runtime interpreter. The compiler-runtime interface typically involves the compiler producing either native machine code or an optimized graph of "kernels" that the runtime can execute. ONNX Runtime's EP mechanism essentially treats a compiled subgraph as a custom fused operator with its own kernel – after partitioning, the runtime just invokes that compiled kernel for the subgraph. This modular design allows swapping different compilers under the hood without changing the high-level API.

Operator Libraries, Kernel Selection, and Fallbacks: Inference engines rely heavily on optimized operator libraries for each hardware type. NVIDIA GPUs use cuDNN, cuBLAS, TensorRT's kernel library, etc., while CPUs may use oneDNN (MKL-DNN) or custom vectorized implementations. The engine must choose the right kernel for each operation considering device, data type (FP32/FP16/INT8), and tensor shapes. Many frameworks use autotuning: for example, TensorRT will benchmark or heuristically select the fastest convolution algorithm for the given shape and target hardware, and TensorFlow does this with cuDNN (trying several convolution algorithms on first run). Kernel selection can thus be static (via predefined heuristics or operator implementations) or dynamic (via profiling). Some engines maintain multiple implementations: e.g. an op could have a GPU kernel, an XLA compiled variant, and a fallback CPU kernel. If the preferred kernel is unavailable (or if the model uses an unsupported custom op), the engine falls back. ONNX Runtime explicitly guarantees a CPU implementation for every ONNX operator as a fallback. That means if, say, the TensorRT EP cannot handle a particular layer, ONNX Runtime will execute that layer on CPU so that the model still runs correctly (albeit slower). PyTorch JIT similarly falls back to the Python interpreter for any ops that are not TorchScript-compatible – the model continues working with a mix of compiled and uncompiled ops. This layered fallback approach ensures functional completeness: the inference can always proceed, with optimized kernels accelerating the parts they can. For extensibility, engines allow registering custom kernels (e.g. user-defined ops in ONNX Runtime or TensorFlow) so that new hardware or new operations can be integrated. In summary, inference systems are a co-design of compilation (graph-level transforms, code generation) and runtime (efficient execution of kernels, memory management, threading). A well-designed engine will maximize usage of high-performance kernels and compilers, but gracefully degrade to slower paths when needed, so that developers get both performance and portability.

2. Memory and Compute Optimization Techniques

To deploy and train large models efficiently, a variety of techniques target reduced memory usage and computational load:

In practice, hybrid parallelism (combining the above) is used for ultra-large models. For example, a large transformer may use 8-way tensor parallelism for each layer and 4 pipeline stages, across e.g. 32 GPUs, plus data parallel across several such groups. This 3D parallelism was key to training trillion-parameter models like those in NVIDIA's Selene supercomputer experiment. They reported 114× throughput improvement when scaling from a 1 billion-param model on 32 GPUs to a 1 trillion-param model on 3072 GPUs by using 8-way tensor + 8-way pipeline parallelism (and data parallel groups). The massive scale (the latest MLPerf Training included a submission with over 10,000 accelerators in one job) shows that with the right parallel strategy, near-linear scaling is achievable. For inference serving, model parallelism (including pipeline) is mainly used to serve models that don't fit on one device or to reduce per-request latency by utilizing multiple devices per request. Data parallelism is more commonly used in inference for throughput (cloning the model to serve many queries concurrently). Overall, memory and compute optimizations – from compression to distributed parallelism – are what enable today's extremely large models to be trained and served in practice, by squeezing as much effective performance out of the hardware as possible.

3. Inference Pipelining and Scheduling

Inference serving must handle many requests efficiently under latency constraints. Batching is a fundamental technique: processing multiple inputs together often yields higher throughput due to better hardware utilization (especially on GPUs, which are throughput-oriented). There are different batching strategies:

Execution Scheduling: Beyond batching, servers must schedule how multiple models and requests run on limited resources. A simple strategy is greedy scheduling – execute whichever requests are available whenever the device is free. More complex scenarios (multiple models, multi-GPU pipelines, etc.) can be formulated as scheduling problems, sometimes as a DAG (directed acyclic graph) of tasks. For example, inference for a request might involve a sequence of CPU preprocessing, GPU inference, then CPU postprocessing – scheduling these with limited thread and GPU resources can be non-trivial. Algorithms like HEFT (Heterogeneous Earliest Finish Time) can be used to schedule DAGs on heterogeneous CPUs/GPUs by considering expected execution times on each. Academic proposals even use ILP (Integer Linear Programming) to optimally assign and order operations to minimize latency or maximize throughput. For instance, the Pesto system formulated the DNN partition+schedule problem as an ILP and achieved optimal placement/scheduling that outperformed heuristics, reducing training time by ~31% vs prior state-of-the-art. In inference serving, an ILP solver might decide how to allocate requests to a pool of devices to meet latency constraints – though ILP is often too slow for real-time decisions, so it's mainly a design-time tool. Practical schedulers might use simpler rules or learned policies. The key considerations are avoiding idle hardware (keep GPUs busy) and avoiding queuing delays for high-priority requests. Many serving systems implement priority queues or service-level objectives – e.g., a high-priority request can preempt batch building. Some research has looked at optimal scheduling of batch inference jobs to meet latency targets, which can be NP-hard, requiring approximations.

Overlapping Computation and Communication: High-performance distributed training taught us to overlap communication with computation; the same applies to distributed inference or even single-device scenarios with I/O. In multi-GPU inference pipelines, one can overlap data transfers (e.g. copying input to GPU) with GPU computation of previous inference. Using separate threads/streams, a server might be feeding the next batch to a GPU (or sending data over the network to another node) while the current batch is still processing. This pipelining reduces bubbles. For example, if model partitioning is used (half the layers on GPU0, half on GPU1), then GPU0 can start processing request N+1 while GPU1 is finishing request N – so long as GPU1 receives the intermediate outputs in time. Overlapping is crucial for good device utilization. In a multi-node pipeline, this becomes a producer-consumer scenario; algorithms like PipeDream's "wait-free backprop" essentially overlap forward passes, backward passes, and parameter updates across pipeline stages. In inference, simpler: just use asynchronous calls so that as soon as a GPU finishes, it's immediately given new work (possibly already prepared while it was busy). Communication overlaps also apply to networking – e.g. streaming data out as soon as results are ready rather than waiting for the entire batch to finish.

Model Partitioning and Placement on Heterogeneous Hardware: In some deployments, different hardware accelerators coexist (CPU, GPU, FPGA, NPU). Deciding which parts of a model run on which device can significantly impact performance. A classic example is running a CNN's conv layers on a GPU (fast for dense arithmetic) but maybe running a small decision tree or post-processing on the CPU. ONNX Runtime's graph partitioner can split a model so that supported ops run on, say, an FPGA and the rest run on CPU. The placement problem can be complex: one must consider device speeds, memory capacity, and the cost of moving data between devices. Research like Moirai and others use cost models or even ILP to find an optimal partition that minimizes overall latency given heterogeneous device profiles. For instance, if an accelerator has very limited memory, you might assign only the most compute-intensive layers to it and leave earlier layers on CPU to reduce data transfer. Heterogeneous placement is increasingly relevant with specialized chips (like a vision DSP + CPU in mobile phones). The goal is to fully utilize each component: e.g., Apple's Core ML will offload what it can to the Neural Engine, then GPU, and use CPU as last resort, in a way that maximizes total throughput. Model parallelism across hetero devices also introduces scheduling needs – e.g., if the CPU part and GPU part run concurrently on different requests (to overlap), one needs to schedule those tasks carefully. Some systems treat heterogeneous inference as a pipeline (with each device as a stage).

Request-Level vs. Operator-Level Pipelining: We can pipeline work at different granularities. Request-level pipelining refers to the scenario where an inference service has multiple processing stages (e.g., feature extraction, inference, result aggregation) and different requests can be in different stages simultaneously. For example, in an online recommendation service, while Request A's model inference is running on the GPU, Request B might already be in the feature lookup stage on CPU – so the requests are processed in an overlapped pipeline. This improves throughput and reduces overall response time by utilizing all components in parallel. It's essentially the classic web server pipeline: parse request -> inference -> format response, each step handled by different threads or workers in parallel for different requests. Operator-level pipelining, on the other hand, pipelines the execution within a single request's model. This is similar to model pipeline parallelism discussed earlier: different layers (operators) of the neural network are assigned to different processors, and successive layers operate like an assembly line on the data. For instance, one can split a 12-layer transformer between two GPUs (6 layers each). While GPU2 is processing layer 7–12 of sample #1, GPU1 can start processing layer 1–6 of sample #2 – thus two samples are in the "operator pipeline". The distinction is basically whether the pipeline is across different requests or within one request's model inference. In practice, operator-level pipelining is used to reduce latency for a single inference when one device alone is too slow or doesn't have enough memory – you achieve parallelism at a finer granularity. Request-level pipelining is more about overall throughput, ensuring high utilization when there are concurrent requests. Both can be combined: e.g., a pipeline-parallel model serving multiple requests at once (each request enters the pipeline staggered). The important consideration is that operator-level pipelining requires splitting the model (which can increase inter-device communication), whereas request-level pipelining requires a multi-stage workflow (which many inference applications naturally have).

In summary, inference optimization involves packing work (batching) and ordering work (scheduling) to best use hardware, overlapping wherever possible. Greedy and heuristic schedulers work well in many cases, but as systems scale, more sophisticated scheduling (even learned or optimized via ILP solvers offline) can squeeze out extra performance. The end goal remains the same: maximize throughput and minimize latency jitter, ensuring each hardware resource (CPU core, GPU, NIC) is doing useful work as much of the time as possible.

4. Server-Based vs. Serverless ML Deployments

ML models can be deployed on dedicated servers/containers or via serverless function services, each with trade-offs:

Dedicated Server (Monolithic or Containerized) Deployments: In the traditional approach, one or more server instances are allocated for the ML service and run 24/7. For example, a Kubernetes cluster might be running a set of pods each serving a model. These servers are always on and typically have the model loaded in memory. The capacity (throughput) is provisioned either statically or via auto-scaling – e.g., scaling up the number of replicas during high load periods. This model avoids startup latency (since the process is long-lived and warm) – thus it offers low latency for requests once the server is up. It also allows maximum control over resources and custom libraries (you can use GPUs, specialized hardware, etc., as needed). However, during idle times, the server is still running (wasting resources). Auto-scaling can partly mitigate this by shutting down instances during low traffic, but scaling in/out isn't instantaneous either (starting a new VM or container might take tens of seconds). The cost model here is typically paying for reserved resources (VMs or containers) regardless of usage, which can be inefficient for spiky or infrequent workloads. On the other hand, for steady high-volume services, dedicated instances are often more cost-effective and predictable in performance than serverless.

Serverless (FaaS) Deployments: Serverless platforms like AWS Lambda, Azure Functions, or Google Cloud Functions allow you to deploy model inference as a function that scales automatically and runs only when invoked. The major benefit is no need to manage servers – the cloud provider handles provisioning containers and scaling them based on incoming events. And you pay per invocation (per 100ms of execution, for example) rather than for idle time. This is ideal for workloads with intermittent traffic: "on-demand Serverless Inference is ideal for workloads with idle periods between traffic spurts". For instance, if a model is used only a few times an hour, serverless can be much cheaper than running a dedicated VM all hour. Serverless can also rapidly scale out to many parallel executions if traffic spikes, without the operator needing to pre-provision capacity – the platform will spawn as many function instances as needed (within limits). However, cold start latency is a primary drawback. When a new serverless instance is invoked after being idle, it must initialize – the container is started and the model must be loaded from storage into memory. Cold starts for ML models can be significant (hundreds of milliseconds to seconds, especially for large models). As one study notes, traditional serverless has many optimizations for quick startup with small code, but "large AI models introduce new complexities", such as loading potentially gigabyte-sized weights and initializing accelerator contexts. For example, loading a 1GB model from disk and CUDA initializing a GPU might take several seconds – a big added latency for the first request. Cloud providers have introduced mitigations: AWS Lambda has a provisioned concurrency option (keep N instances warm), and services like AWS SageMaker Serverless Inference keep idle instances around for a short time after use so subsequent requests hit a warm container. In fact, AWS's serverless GPU inference keeps some GPUs "warm" and loaded with model weights to reduce cold start frequency. Despite this, unpredictable traffic that scales from zero will inevitably incur cold start delays. Serverless is thus best when occasional higher latency is acceptable or workloads naturally have enough steady traffic to keep instances warm.

Resource Allocation and Scaling: In dedicated deployments, one often uses auto-scaling (horizontal scaling by adding/removing instances, or vertical by adjusting resources) to handle variability. This can be based on CPU/GPU utilization or queue latency. Auto-scaling has a reaction time (a new VM might take minutes, a new container maybe tens of seconds) and typically you set min/max capacities. In serverless, scaling is event-driven and rapid: each incoming request can potentially spawn a new function instance immediately (within the platform's limits). The scaling is transparent to the user – you just see more functions running under the hood when load increases, and they disappear when load subsides. This means serverless can handle sudden bursts without manual intervention, whereas a server-based deployment might need headroom or fast scale rules configured. Serverless functions are also ephemeral: after finishing a request (or a short series of batched requests), the instance might be torn down. This leads to efficient resource usage (no idle time costs), but also means state (like a loaded model) might be lost unless the platform keeps the container around for reuse. The cloud cost structure differs: server-based means paying for continuous uptime of instances (possibly with reserved instance discounts), serverless means paying per execution time and memory. For unpredictable or low average utilization, serverless is cost-efficient; for consistently high utilization, dedicated servers can be cheaper because the per-invocation overhead and limits of serverless might add cost. Many organizations adopt a hybrid strategy: e.g., maintain a baseline of dedicated servers for steady load and burst to serverless for spikes, or use serverless for lightweight pre/post-processing while keeping the heavy model on a persistent server.

Cold Start Latency: As mentioned, cold starts are a key concern in serverless ML. A major portion of that latency is loading model weights from remote storage into memory and initializing the framework. For small models, this might be tens of milliseconds; for huge models, it can be minutes if not addressed. The ServerlessLLM project has shown improvements by accelerating model loading to approach hardware bandwidth limits (e.g., using multi-stream reads, or memory-mapped weights). They also suggest smarter checkpoint placement and caching – e.g., storing models in a way that parts of it can be memory-mapped on demand, or keeping a cache of model in memory across warm invocations. Cloud providers sometimes provide provisioned concurrency (as AWS does) where a number of function instances are kept initialized (you pay for that warm pool). This gives the best of both worlds: functions remain "warm" with predictable performance, but you still have auto-scaling beyond that with cold start for extras. Cold starts also involve accelerator context init: e.g. starting a CUDA context on a GPU or loading libraries, which can add a few hundred milliseconds even if the model is in memory. If using CPUs only, cold start is usually faster (tens to a couple hundred ms). Some serverless platforms now offer GPU support (e.g., Lambda GPU). Those will naturally face these same issues of loading big models onto GPUs.

State Management in Serverless ML: Traditional FaaS assumes each invocation is stateless (no carry-over state except what you externalize to a database or cache). But ML inference often benefits from state being kept in memory: the model weights themselves are a form of state – it's expensive to reload them each time. Also, things like a transformer's key-value cache (for streaming inference) or other context need to persist across invocations for efficiency. In a multi-request conversation with an LLM, if each message is a separate function call, the KV cache from previous tokens ideally should be reused; but a pure stateless model would recompute everything if it doesn't retain state. One solution is to