Distributed Systems Observability and HPC Workload Management with Prometheus and SLURM (DR)

28 Mar, 2025

// Deep Research

1. Core Concepts

Monitoring Systems vs. Workload Managers

Monitoring System (Prometheus): A monitoring system continuously collects and stores metrics (measurements) from running services and infrastructure, providing visibility into their performance and health (Putting queues in front of Prometheus for reliability – Robust Perception | Prometheus Monitoring Experts). Prometheus is a prime example – it’s a time-series monitoring system that gathers numeric metrics (CPU usage, request rates, etc.) over time, stores them, and enables querying and alerting on those metrics (Putting queues in front of Prometheus for reliability – Robust Perception | Prometheus Monitoring Experts). Monitoring systems do not execute user workloads; instead, they observe and report on the state of those workloads and the underlying resources.

Workload Manager (SLURM): A workload manager (or batch scheduler) is responsible for scheduling and executing jobs (user-submitted tasks) on a cluster of compute nodes (Slurm Workload Manager - Overview). SLURM (Simple Linux Utility for Resource Management) is an HPC workload manager that allocates resources (nodes/CPUs/GPUs) to jobs, queues jobs until resources are available, and dispatches jobs to run on the cluster (Slurm Workload Manager - Overview). In simple terms, if Prometheus tells you what is happening on your systems, SLURM is one of the systems that makes things happen by running users’ computational jobs. SLURM handles resource arbitration, enforcing policies like job priorities and fairness, whereas Prometheus handles observation, helping administrators detect issues (like high CPU, failed jobs, etc.). Both are crucial in a large cluster: SLURM keeps the machines busy with jobs, and Prometheus ensures you have insight into how those jobs and machines are performing.

Push-Based vs. Pull-Based Metric Collection

Pull Model (Prometheus): Prometheus uses a pull model for metrics – the Prometheus server periodically scrapes each target (application or exporter) by making HTTP requests to fetch the current metric values (Pull doesn't scale - or does it? | Prometheus) (Why is Prometheus Pull-Based? - DEV Community). In this model, the monitoring system initiates the connection. This has several advantages: the monitoring server knows if a target is down (scrape fails), can easily run multiple redundant servers, and one can manually inspect a target’s metrics by visiting its endpoint (Why is Prometheus Pull-Based? - DEV Community). For example, Prometheus can scrape a node exporter on each host every 15 seconds; if a host is unreachable, Prometheus flags it as down immediately. Pull mode also avoids the risk of flooding the server with data – targets emit metrics only when scraped. Prometheus’ designers found that pulling metrics is “slightly better” for these reasons (Why is Prometheus Pull-Based? - DEV Community), although the difference from push is minor for most cases. Prometheus can handle tens of thousands of targets scraping in parallel, and the real bottleneck is typically processing the metric data, not initiating HTTP connections (Pull doesn't scale - or does it? | Prometheus).

Push Model: In a push-based system, the monitored applications send (push) their metrics to a central collector (often via UDP or a gateway). Systems like StatsD, Graphite, or CloudWatch use this model. Push can be useful for short-lived batch jobs or events that can’t be scraped in time. Prometheus accommodates these via an intermediary Pushgateway for ephemeral jobs that terminate too quickly to be scraped (The Architecture of Prometheus. This article explains the Architecture… | by Ju | DevOps.dev) (The Architecture of Prometheus. This article explains the Architecture… | by Ju | DevOps.dev). However, a pure push model requires each application to know where to send data and can overwhelm the collector if misconfigured. Pull systems inherently get a built-in heartbeat (no scrape = problem) (Why is Prometheus Pull-Based? - DEV Community), whereas push systems often need separate health checks. In practice, many monitoring setups use a hybrid: Prometheus mostly pulls, but can pull from a Pushgateway where other apps have pushed their metrics.

Time-Series Data, Labels, and Alerting Semantics

Modern monitoring metrics are stored as time-series: streams of timestamped values for each metric (Data model | Prometheus). Prometheus’s data model is multi-dimensional: each time-series is identified by a metric name and a set of key-value labels (dimensions) (Data model | Prometheus) (Data model | Prometheus). For example, node_cpu_seconds_total{mode="idle", instance="node1"} is a counter of idle CPU seconds for instance node1. Labels allow slicing and dicing metrics (e.g., aggregate CPU by mode or host) and are fundamental to Prometheus’s query power (Data model | Prometheus). This label-based approach contrasts with older systems that used rigid metric naming conventions – Prometheus’s flexible labels enable powerful ad-hoc queries over dimensions like service, datacenter, instance, etc. However, high-cardinality labels (like per-user IDs) should be avoided as each unique label combination produces a new time-series, exploding data stored (Metric and label naming | Prometheus).

Prometheus stores recent time-series data locally in its TSDB (Time Series Database) and uses efficient techniques (WAL and compaction) to manage it. New samples append to an in-memory buffer and a write-ahead log (WAL) on disk for durability (Storage | Prometheus). Periodically, samples are compacted into longer-term storage files (blocks), e.g., 2-hour blocks combined into larger blocks up to the retention period (Storage | Prometheus). This design makes writes fast and recoverable (WAL replay on restart) while keeping older data in compressed form. It’s a single-node datastore by design – Prometheus doesn’t cluster its storage (external systems like Thanos/Cortex are used for that), favoring simplicity and reliability of one node (Overview | Prometheus).

Alerting: On top of metric collection, a monitoring system provides alerting semantics. In Prometheus, alerts are defined by alerting rules which continuously evaluate PromQL expressions and fire when conditions are met (Alerting rules | Prometheus). For instance, an alert rule might check if CPU usage >90% for 10 minutes. When the rule condition is true, Prometheus marks an alert “firing” and sends it to the Alertmanager, an external component that de-duplicates alerts, applies silences/inhibitions, and routes notifications (email, PagerDuty, etc.). An alert is essentially a special time-series that becomes active when some metric condition holds (Alerting rules | Prometheus). This approach means alert logic is version-controlled and reproducible (just queries with thresholds), rather than hidden in code. Prometheus’s alerting emphasizes being timely and reliable (it may drop some metric samples under load, but tries hard not to miss firing an alert when an outage happens) (Putting queues in front of Prometheus for reliability – Robust Perception | Prometheus Monitoring Experts). In summary, monitoring systems track metrics over time and raise alerts on abnormal conditions, but they don’t remediate problems themselves. That’s left for humans or automation systems.

Batch vs. Interactive Jobs, Queues, and Scheduling Policies in HPC

In HPC environments managed by a scheduler like SLURM, users typically run batch jobs. A batch job is a non-interactive workload submitted to a queue, often via a script with resource requirements (e.g., 16 cores for 2 hours). The job waits in a queue until the scheduler allocates the requested resources, then runs to completion. This is in contrast to interactive usage (like running commands on a login node or using an interactive allocation). HPC systems separate these for efficiency and fairness: users submit work to the scheduler instead of directly logging into compute nodes (Scheduling Basics - HPC Wiki) (Scheduling Basics - HPC Wiki). Batch jobs ensure the cluster runs at high utilization with managed scheduling, whereas direct interactive runs could conflict and overload resources.

Interactive jobs in SLURM can be run with commands like srun or salloc which grant a shell or run a command on allocated nodes in real-time. These are useful for debugging or running interactive applications (e.g., an interactive Jupyter notebook on compute nodes). They still go through SLURM’s allocation mechanism (just with immediate execution after allocation). In practice, HPC centers have login nodes for compiling and prepping, but computation happens via batch or interactive jobs on compute nodes that are otherwise not directly accessible (Scheduling Basics - HPC Wiki).

Scheduling Policies: SLURM uses a combination of scheduling policies to decide which job to run from the queue when resources free up. By default it’s FIFO (First-Come, First-Serve) modified by priority factors. Administrators configure a priority multifactor plugin that gives each job a priority score based on factors like waiting time (age), job size, user fair-share, quality-of-service (QoS), etc. (Slurm Workload Manager - Overview) (Slurm Workload Manager - Overview). For fairness, SLURM supports fair-share scheduling, where users or projects are assigned shares of the cluster and if someone has used more than their share recently, their new jobs get lower priority (Slurm Workload Manager - Classic Fairshare Algorithm). This prevents one user from monopolizing the cluster – over time the scheduler “banks” usage and biases in favor of those who have run less.

Another important HPC policy is backfill scheduling. Backfilling allows smaller jobs to run out-of-order to avoid wasting idle resources, so long as they don’t delay the start of a higher-priority (often larger) job at the top of the queue (Scheduling Basics - HPC Wiki). The scheduler will reserve nodes for a big job that can’t start yet (maybe waiting for more nodes to free), but in the meantime will backfill shorter jobs into the gaps. This significantly improves utilization: short jobs experience very short queue times since they can slip in, and the large job still starts as soon as its reserved resources become free (Scheduling Basics - HPC Wiki). SLURM’s scheduler can run a backfill cycle periodically to find these opportunities.

SLURM also supports preemption (high-priority jobs can force lower ones off, for example, an urgent job preempts a running job which might be requeued or canceled) and partitions (distinct sets of nodes with their own job queues and limits, analogous to multiple job queues) to enforce different policies. For instance, an “interactive” partition might allow only short jobs for quick turnaround, or a “gpu” partition ensures only GPU-equipped nodes are used for GPU jobs. Priority/Fairness settings along with partitions and QoS give administrators a rich toolset to enforce organizational policies (e.g., research groups get equal share, or certain jobs always have higher priority). The key is that HPC workload managers balance throughput, utilization, and fair access (Scheduling Basics - HPC Wiki): the scheduler tries to minimize wait time, maximize node usage (CPU/GPU not sitting idle), and ensure no single user or project unfairly hogs the cluster.

Control Plane vs. Data Plane in Cluster Systems

In distributed systems, we distinguish the control plane – which makes decisions and orchestrates – from the data plane – which carries out the actual work (Control Plane vs. Data Plane: What’s the Difference? | Kong Inc.). In our context:

Prometheus’s control plane is the Prometheus server: it decides when and where to scrape metrics, evaluates rules, and sends alerts. The data plane of monitoring are the instrumented applications and exporters (the targets exposing metrics). They generate and serve metric data but don’t decide when to send it. In a pull model, the Prometheus server is firmly in control: it enforces the policy “scrape every 30s” and knows who to scrape. The targets simply carry out the action of collecting metrics when asked. This separation improves reliability – you can restart Prometheus (control plane) without affecting the applications (data plane), and vice-versa (Overview | Prometheus).
SLURM’s control plane is the central scheduler daemon, slurmctld (and associated components like schedulers, controllers). slurmctld makes all the decisions: which jobs to run, on which nodes, when to preempt, etc. (Slurm Workload Manager - Overview). The data plane in SLURM are the compute nodes and their local daemons (slurmd) that actually execute the jobs (Slurm Workload Manager - Overview). Slurmctld issues orders (launch job X on node Y), and the slurmd on each node carries them out by spawning processes, enforcing resource limits, and reporting back status. If we draw an analogy, slurmctld is like an air traffic controller (control plane) and the slurmd daemons plus user jobs are the airplanes and runways (data plane). The control plane handles the policies – enforcing scheduling algorithms, managing the job queue – while the data plane (compute nodes) performs the computations. This separation is critical for scalability and fault tolerance. For example, SLURM can run with a backup controller: if the primary slurmctld fails (control plane failure), the backup takes over, while the running jobs on compute nodes are largely unaffected and continue (data plane continues operating) (Slurm Workload Manager - Overview). The control plane vs. data plane concept also appears in Kubernetes and other cluster managers: the control plane (Kubernetes API server & scheduler) decides where containers run, and the data plane (kubelet on nodes, the containers themselves) does the actual work. In summary, the control plane establishes and enforces policy, while the data plane carries out the tasks according to that policy (Control Plane vs. Data Plane: What’s the Difference? | Kong Inc.). Understanding this separation helps in debugging and design: e.g., an issue in slurmctld (control) could stall new job scheduling, but wouldn’t directly kill running tasks (data); conversely, a node crash (data plane issue) is handled by the control plane noticing and requeueing that node’s jobs.

2. Architectural Overview

Prometheus Architecture

Prometheus follows a simple but powerful single-server architecture encompassing metric collection, storage, querying, and alerting in one cohesive system (The Architecture of Prometheus. This article explains the Architecture… | by Ju | DevOps.dev). The main components of a Prometheus deployment are:

Scraping Engine & Service Discovery: The Prometheus server periodically scrapes configured targets by sending HTTP requests to their /metrics endpoints (The Architecture of Prometheus. This article explains the Architecture… | by Ju | DevOps.dev). Targets can be statically configured or dynamically discovered (through integrations with service discovery systems like Kubernetes, Consul, etc.). Prometheus supports various service discovery mechanisms to automatically find targets (for example, discovering all Kubernetes pods with certain annotations). This scraping engine is highly parallel and efficient, capable of scraping tens of thousands of targets per second (Pull doesn't scale - or does it? | Prometheus). Because it’s pull-based, adding a new Prometheus instance (for redundancy or dev testing) is as easy as pointing it at the targets – each will scrape independently without instrumented apps needing to know about it (Why is Prometheus Pull-Based? - DEV Community).
Time Series Database (TSDB): All scraped samples are stored locally in Prometheus’s embedded time-series database. This TSDB is optimized for write-heavy workloads and append-only access patterns. New data points are appended in memory and to the WAL (Write-Ahead Log) on disk for durability (Storage | Prometheus). Background compaction processes merge recent data into longer-term block files (by default, 2-hour blocks, which then get compacted into 1-day, etc., up to the retention limit) (Storage | Prometheus). The TSDB uses a custom storage format with efficient compression (using techniques like Gorilla for time stamps and delta-of-delta encoding for values). This allows a single Prometheus server to store millions of series and handle high ingestion rates (hundreds of thousands of samples per second in real-world deployments (Pull doesn't scale - or does it? | Prometheus)). One trade-off: the Prometheus TSDB is not clustered – each server is standalone (no automatic replication). For durability beyond one node, you rely on backing up data or using remote storage integrations. However, this design avoids complex distributed consensus and makes each Prometheus instance simpler and more robust (it’s often possible to run multiple identical Prometheus servers for HA without them needing to coordinate, and use tools like Thanos to deduplicate data externally) (Overview | Prometheus).
PromQL Query Engine: Prometheus provides a powerful query language called PromQL for on-the-fly computations and data retrieval. The PromQL engine allows selecting time-series by label filters and applying aggregations, functions, and arithmetic. For example, one can query the average CPU usage across all nodes with:
```
avg(rate(node_cpu_seconds_total{mode!="idle"}[5m])) by (instance)
```
This query calculates a 5-minute per-second CPU usage rate for each instance, excluding idle time, and then averages by instance (so essentially, average non-idle CPU per node). The engine supports many functions like rate() (change per second), avg_over_time, max, joins between metrics, etc. PromQL is designed to leverage the multi-dimensional data – you can aggregate along any label. It executes queries on data in the TSDB (and can perform subqueries to first compute an intermediate time-series over a range (PromLabs | PromQL Cheat Sheet)). For example, a subquery might compute a rolling 5m average of a metric over the past hour: rate(http_requests_total[5m])[1h:1m] (meaning “for each 1m interval in the last hour, compute the 5m rate”) (PromLabs | PromQL Cheat Sheet). This is useful for smoothing and advanced alert conditions. The query engine is also used for recording rules (which precompute time-series) and alerting rules.
Alertmanager: While the Prometheus server detects alert conditions, it offloads the notification and escalation logic to the Alertmanager (a separate service). Prometheus servers send firing alerts to Alertmanager, which then groups related alerts, suppresses noisy ones (for example, if many host-down alerts fire, it might group them into one notification), and routes them to channels like email, Slack, PagerDuty, etc. The Alertmanager supports silencing (e.g., silence alerts during maintenance) and inhibition (don’t alert on dependent systems if a core system is down). In a typical setup, one runs a small cluster of Alertmanager instances for redundancy. This decoupling allows Prometheus to focus on metric scraping/storage, while Alertmanager handles the policy of whom to notify and how. Together, they provide a complete monitoring/alerting solution: Prometheus evaluates what is wrong, Alertmanager decides who should know and how to inform them (Alerting overview - Prometheus).
Exporters and Targets: Not part of Prometheus core, but worth mentioning: most systems expose metrics via exporters or instrumentation libraries. For instance, Node Exporter exposes Linux host metrics (CPU, mem, disk) for Prometheus to scrape (Monitoring Linux host metrics with the Node Exporter | Prometheus). There are exporters for databases (MySQL, PostgreSQL), hardware (network switches via SNMP), and more. Applications can also link Prometheus client libraries (for Go, Java, Python, etc.) to expose custom metrics. These components live on the data plane, as discussed, but are integral to the overall architecture – they are the eyes and ears feeding Prometheus data.

A simplified view of Prometheus’s architecture is: the Prometheus server scrapes metrics from instrumented jobs (or exporters), stores those samples locally, then runs rules on that data to produce aggregated series or trigger alerts (Overview | Prometheus). Users can query the data via PromQL (directly via API or through Grafana dashboards). Everything is designed to be self-contained and reliable on a single node (no external dependencies for core operation) (Overview | Prometheus). For scalability, you run multiple Prometheus servers (e.g., sharding by metric kind or environment) or use federation/remote write to hierarchical systems, but each one keeps the same simple internal architecture. Figure 1 illustrates Prometheus’s components and ecosystem at a high level (Prometheus: Monitoring at SoundCloud | SoundCloud Backstage Blog) (with optional components like Pushgateway and Grafana as external integrations).

SLURM Architecture

SLURM is built as a distributed, modular system with a central brain and per-node agents to actually execute jobs (Slurm Workload Manager - Overview). The key components of SLURM’s architecture include:

slurmctld (Control Daemon): This is the central controller that runs on the management node (or a pair for failover). slurmctld maintains the global state of the cluster: it knows about all nodes, all queued jobs, user accounts, partition configurations, etc. (Slurm Workload Manager - Overview). It is responsible for scheduling decisions – when a compute node becomes free, slurmctld picks the highest priority pending job that fits and allocates the node to that job. It also manages job submission, cancellation, and handles messages from nodes. There can be a backup slurmctld to take over if the primary fails, supporting HA in the control plane (Slurm Workload Manager - Overview).
slurmd (Node Daemon): This daemon runs on every compute node (the “worker” nodes of the cluster) (Slurm Workload Manager - Overview). slurmd waits for instructions from slurmctld to start or stop jobs on that node. When a user job is scheduled to a node, slurmctld communicates with that node’s slurmd, which then launches the job’s processes (via a unified launcher akin to a remote shell) (Slurm Workload Manager - Overview). Slurmd sets up the job’s cgroup for resource limits, executes the task, monitors its progress, and reports back exit status or any issues. The slurmd daemons form a communication hierarchy that is fault-tolerant – for example, they can forward messages if direct contact to slurmctld is lost, and perform some host-level error recovery (marking themselves down, etc.) (Slurm Workload Manager - Overview). In essence, slurmctld is the master scheduler, and each slurmd is the executor on a node.
slurmdbd (Database Daemon): This optional component handles accounting and historical record-keeping (Slurm Workload Manager - Overview). SLURM can be configured with a central database (MySQL or MariaDB typically) to store data on job submissions, completions, resource usage, user accounts, etc. slurmdbd acts as a broker between slurmctld and the database – it receives accounting logs from the controller and writes them to the DB, and also serves queries (for commands like sacct which reports on past jobs) (Slurm Workload Manager - Overview). Using slurmdbd is crucial for clusters where you need to track usage for many users over time (for example, to enforce fair share over a past usage window, or to generate utilization reports, or implement quotas/limits per account). One slurmdbd can aggregate info for multiple clusters into one database, enabling multi-cluster reporting.
User Commands and APIs: Users and administrators interact with SLURM via commands (or the REST API slurmrestd). Important user commands include: sbatch (submit a batch script to the queue), srun (launch a job or job step, possibly interactive), salloc (request an interactive allocation), squeue (view queued/running jobs), sinfo (view node and partition states), scancel (cancel a job), sacct (show accounting info for completed jobs), etc. (Slurm Workload Manager - Overview). Administrators have scontrol (to view or modify cluster state, e.g., set a node down or adjust a job’s priority) and sacctmgr (to manage accounts, users, and fairshare settings in the database) (Slurm Workload Manager - Overview). These commands send requests to slurmctld, which is the authority on cluster state. For example, sbatch will contact slurmctld to add a new pending job to the queue, and squeue asks slurmctld for the list of current jobs and their statuses. SLURM also has a Python/C API and the REST API for integration with external tools.
Scheduler Plugins: SLURM’s scheduling behavior can be customized via plugins. The default scheduler is a simple FIFO queue with backfill (called sched/backfill and sched/basic in SLURM’s plugin system). But SLURM allows sites to choose different scheduling plugins or tune parameters. For instance, enabling the fair-share plugin will cause slurmctld to compute job priority based on user fairshare usage (Slurm Workload Manager - Overview). There are also plugins for gang scheduling (time-slicing on nodes), advanced reservations, and topology-aware scheduling (optimizing job placement based on network topology) (Slurm Workload Manager - Overview). These plugins extend the core functionality by hooking into the control daemon’s decision-making. The Multifactor Priority plugin, for example, implements a formula that combines job age, size, fairshare, etc., into a priority value (Slurm Workload Manager - Overview). The backfill scheduler runs as a separate thread that looks for jobs to start early without delaying higher priority jobs. All these operate within the slurmctld process as loadable modules, making SLURM very flexible for different HPC site policies.

Figure 2 below (SLURM Components) shows a high-level diagram of this architecture (Slurm Workload Manager - Overview). There is a central slurmctld (and optional backup), numerous slurmd daemons on each compute node, an optional slurmdbd connected to a database for accounting, and various clients (commands or API users) interacting with slurmctld. The design is purposefully decentralized for scalability: the heavy work of running jobs is distributed to the slurmds, while the central daemon focuses on scheduling logic and coordination. This has been proven to scale to clusters of tens of thousands of nodes by minimizing per-job overhead and using efficient RPCs between controller and nodes (Slurm Workload Manager - Overview). The SLURM controller and nodes exchange messages for job launch, completion, and heartbeats. If a node fails, slurmctld notes it and can requeue the node’s jobs or alert the admins. If slurmctld fails, the backup can take over without interrupting running jobs (which slurmd will continue to run and then later report to the new controller).

To summarize, SLURM’s architecture consists of: (1) a central brain scheduling jobs and managing state, (2) distributed agents on each node to execute and report on jobs, (3) an optional accounting database for historical data, and (4) a suite of user-facing commands/APIs to interact with the system (Slurm Workload Manager - Overview) (Slurm Workload Manager - Overview). This modular design (with plugins for different features) allows SLURM to run everything from small clusters to the world’s largest supercomputers by adjusting components and scheduling algorithms as needed.

3. Observability and Monitoring in Depth

Instrumentation Best Practices (Counters, Gauges, Histograms)

To make systems observable with Prometheus, proper instrumentation is essential. Prometheus client libraries expose four main metric types, each with best-practice usage patterns (Instrumentation | Prometheus) (Instrumentation | Prometheus):

Counter: A counter represents a cumulative value that only increases (or resets to zero on restart). Use counters for things that monotonically increase, such as the number of requests served, tasks completed, or errors occurred. A counter should never be decremented – if you need to track decreases or values that go up and down, that’s a gauge. Prometheus expects counters to only increase, so it can derive rates (rate() or increase()) reliably (Instrumentation | Prometheus). Example: http_requests_total is a counter that counts total HTTP requests received. We don’t decrease it when a request finishes; it just keeps incrementing for each new request. In queries, one rarely looks at the raw counter (e.g., “we have 1,234,567 requests total” is less useful than “requests per second”). Instead, you’d use PromQL’s rate() or increase() on counters to get rates over time (Instrumentation | Prometheus). Rule of thumb: if the value can go down, it’s not a counter (Instrumentation | Prometheus).
Gauge: A gauge is a metric that can arbitrarily go up and down, representing the current value of some quantity. Use gauges for measurements of a state at a point in time – e.g., current memory usage, temperature, queue length, number of active threads. Gauges are straightforward: you set them to the latest value whenever it changes. They are like an instrument gauge (speedometer) – it reflects the current reading. Example: node_memory_bytes_free is a gauge indicating how many bytes of RAM are free on a host right now. Gauges can be aggregated (you might sum memory across nodes) or used in alert comparisons directly (if free_memory < X). Important: never take a rate() of a gauge – since it can go down, the concept of “per-second change” may be meaningless or even negative (which rate() isn’t meant for) (Instrumentation | Prometheus).
Histogram: A histogram tracks the distribution of a set of observations (like request durations or message sizes) by bucketizing them. In Prometheus, a histogram is implemented as a set of counters – for each predefined bucket, it counts how many observations fell into that bucket, and also provides a total count and sum of all observations (Histograms and summaries | Prometheus). For instance, you might have a histogram metric http_request_duration_seconds with buckets like 0.1s, 0.5s, 1s, 5s, ... The Prometheus client will count how many requests took <=0.1s, <=0.5s, etc. This allows calculating quantiles (like 95th percentile latency) on the server side using the histogram data. Best practice: choose buckets wisely for the latency/size distribution you expect, and prefer relatively broad buckets if you will aggregate across many instances (since aggregating histograms is easy – you can sum the counts in each bucket from multiple instances). PromQL provides a function histogram_quantile() to compute approximate quantiles from histogram buckets (Histograms and summaries | Prometheus). Histograms are powerful for aggregatable latency/SLO tracking, but note they create multiple time-series (one per bucket + sum + count), which can add up. As guidance, keep the number of buckets reasonable (e.g., dozens, not hundreds) and use them when you need the distribution, not just an average.
Summary: A summary is similar to a histogram in that it tracks distribution (count, sum, and configurable quantile estimates), but the quantile computation is done client-side. The summary type will report pre-computed quantiles (median, 95th, etc.) in the scraped data. However, summaries have a major limitation: you cannot aggregate quantiles from multiple instances (the quantile on each instance doesn’t combine into a global quantile). Histograms, on the other hand, can be aggregated (by summing bucket counts) and then quantiles can be calculated from the merged distribution (Histograms and summaries | Prometheus). Generally, summaries are recommended only in specific cases where you can’t pre-define meaningful buckets or you only need per-instance (or single job) latency without cross-instance aggregation. Many sites avoid summaries due to the aggregation issue and use histograms for anything that might need cluster-wide or service-wide latency analysis (Histograms and summaries | Prometheus) (Histograms and summaries | Prometheus).

Some additional instrumentation best practices from Prometheus maintainers (Instrumentation | Prometheus) (Instrumentation | Prometheus):

Use base units (seconds, bytes) for metrics to keep consistency (Metric and label naming | Prometheus). For example, expose durations in seconds (not milliseconds) and sizes in bytes (not MB), with appropriate metric names (e.g., a counter task_duration_seconds_total summing seconds). This avoids confusion when comparing metrics.
Do not mix different semantics in one metric. For instance, don’t use one metric with a label “state” that can be “free” or “capacity” and put values of different meaning in it – separate metrics are better (one for total capacity, one for current usage). A rule of thumb: the sum or average of all label values of a metric should make sense (Metric and label naming | Prometheus). If summing over a label dimension doesn’t yield a meaningful value, you probably should have split metrics. For example, having a metric queue_length{queue="X"} is fine (sum over all queues = total items in all queues). But if you had queue_metric{queue="X", type="capacity|current"}, summing capacity and current length together makes no sense – better to have two metrics: queue_capacity and queue_length.
Avoid high cardinality labels: As noted earlier, each unique label combination -> a time-series in the TSDB. Metrics with unbounded or high-cardinality labels (like a label for user ID, request path, or session ID) can blow up the storage and slow queries (Metric and label naming | Prometheus). Instead, consider alternatives: can you aggregate or categorize instead of labeling with ID? Perhaps export only counts of users in certain buckets, or have a separate logging system for per-user details. A notorious bad pattern would be labeling metrics with an error message or file name – there could be as many distinct values as events. If you absolutely need to expose such high-cardinality data, sometimes a log or trace is more appropriate than a metric.
Don’t enforce complex relationships with labels – e.g., having two labels whose combination yields a huge cardinality (cartesian product). If you have metric request_count{user, endpoint}, and there are 1000 users and 1000 endpoints, that could be up to 1,000,000 series if every user hits every endpoint. If that’s too many, consider dropping one label or encoding common groupings differently. In summary, design metrics to be expressive but also bounded in dimensionality.

By adhering to these practices, you ensure that the data Prometheus collects is both useful and efficient to process. Counters for events, gauges for states, histograms for distributions (especially for request latency or sizes which are critical for SLOs), and mindful labeling will lead to an instrumentation that can scale to millions of series without becoming a headache.

Exporters: Node Exporter, cAdvisor, and Kube-State-Metrics

Prometheus’s rich ecosystem of exporters makes it easy to monitor all parts of your stack, including third-party systems and underlying infrastructure, without modifying them. Some key exporters and integrations relevant to large-scale systems and HPC:

Node Exporter: The Prometheus Node Exporter is a lightweight agent that exposes a wide variety of Linux host metrics – CPU usage, memory, disk I/O, filesystem space, network stats, and dozens of kernel metrics (like entropy, context switches, etc.) (Monitoring Linux host metrics with the Node Exporter | Prometheus). You run node_exporter on each server; it listens on a port (default 9100) and provides metrics at /metrics. This exporter is fundamental for cluster monitoring – it provides the raw machine-level data. For example, metrics like node_cpu_seconds_total (CPU time broken down by mode), node_memory_MemAvailable_bytes, node_filesystem_free_bytes, etc., allow building dashboards for resource utilization and diagnosing hotspots at the hardware/OS level. In HPC context, node exporter can be augmented with cluster-specific metrics (like Infiniband network stats or Lustre filesystem stats) via textfile collectors or specialized exporters, but it’s the baseline for node health. Windows systems have an analogous Windows exporter for OS stats (Monitoring Linux host metrics with the Node Exporter | Prometheus).
cAdvisor (Container Advisor): cAdvisor is an open-source tool (integrated into Kubernetes nodes or standalone) that collects metrics about containerized applications. It monitors resource usage of containers – CPU, memory, file system, network – and provides per-container metrics. cAdvisor is often used to feed Prometheus in environments like Docker or Kubernetes: each node’s kubelet has cAdvisor built-in exposing metrics on /metrics/cadvisor. In standalone form, you can run cAdvisor on a host and it will autodiscover Docker containers and expose their metrics. It is described as “cAdvisor (short for container Advisor) analyzes and exposes resource usage and performance data from running containers” (Monitoring Amazon EKS on AWS Fargate using Prometheus and ...). For example, cAdvisor will expose metrics like container_cpu_usage_seconds_total{container="X"} for each container X, and memory, etc. This is crucial for microservice or cloud-native environments where you need to monitor inside the machine boundary at the container level. In HPC, if jobs are run inside containers (using Singularity or Docker), cAdvisor could theoretically be used, but more commonly HPC jobs are not individually containerized by Docker. However, HPC clusters might still run some system services in containers or even run a Kubernetes sidecar cluster (more on hybrid later), in which case cAdvisor’s data is important.
kube-state-metrics: This is an exporter that focuses on Kubernetes object states rather than resource usage. It connects to the Kubernetes API and generates metrics about the desired and current state of things like Deployments, Pods, Nodes, etc. For example, kube-state-metrics will produce metrics such as kube_pod_status_phase{namespace, pod, phase} indicating how many pods are in Running/Pending, or kube_deployment_spec_replicas vs kube_deployment_status_replicas_available. Essentially, it turns the state of K8s resources into Prometheus metrics (kube-state-metrics: Your Guide to Kubernetes Observability | Last9). This is extremely useful for alerting on conditions like “Deployment X has not met its desired replica count” or “there are 5 CrashLoopBackoff pods”. In a hybrid cluster with Kubernetes, kube-state-metrics complements cAdvisor. While cAdvisor gives you resource usage of pods, kube-state-metrics gives you the orchestrator’s perspective (are pods healthy, are cronjobs succeeding, etc.). Kube-state-metrics is not directly related to SLURM, but if you manage an HPC cluster alongside cloud orchestration, you’ll likely run both node exporter (for raw nodes) and kube-state-metrics (for K8s) to cover different layers. Note: Kube-state-metrics does not duplicate what node exporter or cAdvisor do – it focuses on high-level states from the API server and deliberately doesn’t report resource metrics like CPU (so there’s no conflict with cAdvisor). It’s a great example of an exporter that isn’t tied to a single process or OS, but to an API: in this case, it listens to K8s and exports metrics reflecting that state (kubernetes/kube-state-metrics: Add-on agent to generate ... - GitHub).
Other Exporters: There are hundreds: for databases (e.g., PostgreSQL exporter that runs SQL queries to gather stats like xact rate, cache hits), for messaging systems (RabbitMQ, Kafka exporters), storage systems (Ceph, ElasticSearch), etc. In HPC environments, common ones include the Torque/PBS exporter if using those schedulers, or an Nvidia GPU exporter (though Prometheus can also get GPU stats via DCGM exporter or by nvidia-smi text output). Also, cluster-specific ones: for example, Infiniband network exporters to capture fabric performance counters, or power monitoring exporters. Since our focus is SLURM, one important exporter is the slurm exporter (discussed in Integration section) that exports SLURM job queue and node state metrics for Prometheus.

The guiding principle of exporters is that they bridge external systems to Prometheus by translating whatever monitoring interface those systems have (be it CLI commands, APIs, or /proc files) into Prometheus metrics exposition format. This avoids having to modify those systems. For instance, Node exporter reads lots of files from /proc and /sys in Linux to get its metrics, cAdvisor taps into the container runtime, and kube-state-metrics uses the K8s API. As a user of Prometheus, you just point Prometheus at these exporters’ endpoints. This provides a uniform view in Prometheus where everything is a time series with labels, even though under the hood data came from very different sources.

SLIs and SRE Principles (Choosing the Right Metrics)

In Google’s Site Reliability Engineering (SRE) practices, a Service Level Indicator (SLI) is a carefully defined metric that reflects the quality of service – essentially, what your users care about. Typical SLIs include request latency, error rate, throughput, availability, etc. (Google SRE - Defining slo: service level objective meaning). For example, an SLI might be “the fraction of requests that succeed” or “99th percentile response time”. These tie into SLOs (objectives) and SLAs (agreements) which set targets for these indicators (e.g., 99.9% of requests under 200ms). When instrumenting systems, it’s important to capture metrics that can serve as SLIs for your services. Prometheus is often used to gather those metrics.

SRE’s Golden Signals: Google’s SRE book suggests monitoring four key signals for any service: Latency, Traffic, Errors, Saturation. Latency and errors are direct indicators of user experience (how fast? how often failures?). Traffic (e.g., QPS or requests per second) gives context of load. Saturation refers to how utilized the system is (CPU, memory, etc.), indicating how close to limits the service is. Prometheus instrumentation should ideally cover these: e.g., an HTTP service would have a latency histogram (for request duration), a counter for requests (traffic) possibly partitioned by outcome (to compute error rate), and gauges or resource metrics to measure saturation (CPU, memory from node exporter, etc.). Many off-the-shelf libraries (like client_http_server in Go or Spring Boot metrics in Java) expose such metrics.

In HPC terms, SLIs could be things like job success rate, scheduling latency (time from job submission to start), or cluster utilization. For instance, if you treat the “cluster” as a service to users, you might have an SLI “job start time within X minutes” or “percentage of nodes available”. However, HPC users often care about throughput (jobs per day) and fairness, which are harder to capture as single SLIs. Still, monitoring things like slurm_job_pending_time_seconds (if such metric is exported per job) could be valuable to see if queue times are rising, indicating potential issues.

Metric design for SLIs: SRE encourages that the metrics you choose should closely measure user-facing outcomes (Google SRE - Defining slo: service level objective meaning) (Google SRE - Defining slo: service level objective meaning). For example, if you run a web portal, an SLI could be the successful page load rate. For an HPC batch system, an SLI for “job scheduling” might be the fraction of jobs started within their expected wait time or the failure rate of jobs due to system errors (not user errors). It’s important to differentiate system reliability from user-caused failures. If a job fails because of a code bug (non-zero exit), that might not count against the service reliability from the SRE perspective (the HPC system was functioning). But if jobs fail due to node crashes or scheduler errors, that is an issue. Thus, one might define an SLI like “% of node-hours available vs total” as an availability measure of the HPC infrastructure.

Using Prometheus, you can implement these SLIs. For example, error rate SLI: define an alert on rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01 to detect if >1% requests are 5xx (server errors). Or for HPC, monitor the slurm_job_failed_count metrics (if available) to alert if system-wide job failures spike. Google’s SRE also emphasizes aggregation and roll-up: raw metrics need to be aggregated to meaningful levels (e.g., overall success rate vs per-instance). Prometheus’s ability to aggregate by labels helps here – you might compute SLIs via recording rules, e.g., a rule that calculates “cluster_job_success_ratio” by dividing successful jobs by total jobs over some window, excluding user error codes.

Another principle is alert on symptoms, not causes. That means set alerts on the SLI breaches (e.g., “error rate > 5%” or “95th latency > 300ms for 15m”), rather than on intermediate metrics like CPU or memory unless they directly impact service. In an HPC context, a symptom alert might be “job queue wait time above 1 hour for highest priority jobs” as a symptom that something’s wrong (maybe cluster is full or scheduler stuck). The cause could be varied (bad node, etc.), but the symptom catches the user-facing impact.

Service Level Objectives (SLOs) tie into this: if your SLO is “99% of jobs complete successfully”, you’ll watch a metric of successful job completion and ensure it stays above 99%. Prometheus can be used to track SLO compliance over time (e.g., using recording rules to compute rolling success rates). A common pattern is to calculate an “error budget burn rate” – how fast are you consuming the allowable errors – using PromQL, and alert if the burn rate is too high.

In summary, when designing monitoring for systems (including HPC clusters), identify your critical SLIs – latency, throughput, success rate, utilization, etc. – and ensure those are measurable via metrics. Use counters/histograms for latency and errors (like request durations, job runtime distribution, error counts), use ratios in PromQL to gauge success rates, and use gauges for saturation (like cluster utilization = used_nodes/total_nodes). By following the SRE approach, you focus on metrics that matter to reliability and user happiness, rather than drowning in irrelevant data. And by leveraging Prometheus’s label and aggregation model, you can get both high-level views and detailed breakdowns as needed for troubleshooting.

High-Cardinality Pitfalls and Metric Design Anti-Patterns

As clusters grow and systems become more complex, it’s easy to be tempted to label metrics with very granular dimensions. However, high-cardinality metrics (metrics with a huge number of unique label combinations) can severely impact Prometheus’s performance and memory use (Metric and label naming | Prometheus). Each unique label-set (time-series) consumes memory for tracking and CPU for processing queries. Let’s discuss a few pitfalls and anti-patterns:

Per-User or Per-Job Metrics: In HPC, one might try to export metrics for each job or each user. For example, a metric labeled with job_id for job runtime or user_id for usage. This quickly becomes infeasible: thousands of jobs run daily, and user IDs can be in the hundreds or thousands. If you had a metric job_cpu_seconds{job_id=12345}, you’d end up with as many time-series as jobs – likely overwhelming Prometheus. Instead, metrics should be aggregated or sampled. SLURM’s Prometheus exporter, for instance, does not export per-job CPU directly. It exports counts of jobs in certain states, or maybe the sum of CPU time used by all running jobs, etc., which are bounded. It does export per-user and per-account job counts (Running/Pending) (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.), but those cardinalities are equal to number of users/accounts (which is usually manageable – perhaps a few hundred). In general, avoid labeling by an ID that can have an unbounded range (job IDs increment forever). If you need to analyze specific job durations or such, that might be better done by querying the job scheduler’s accounting database or logs, not by keeping a time-series for each job in Prometheus.
Dynamic Labels from Data: Another anti-pattern is including things like error message text, file paths, query strings, or other highly variable data as a label value. For example, imagine an exporter for a web server that labels a metric with the URL path of each request – there could be millions of distinct URLs (including IDs or queries), which would kill the TSDB cardinality. Or including a timestamp or UUID as a label (which obviously makes every sample unique). A real-world example: labeling a metric with Kubernetes pod name – since pod names often have unique suffixes for each deployment instance, they churn a lot. It’s usually better to label by something like app or service which is stable, rather than each instance name.
Combinatorial Explosion: Sometimes each metric alone seems fine, but combined labels blow up. For example, suppose you have a microservice metrics labeled by endpoint and status_code. That’s reasonable (say 20 endpoints * 5 status classes = 100 series). But if you also add user_tier (free vs paid, etc.), that multiplies (2*100 = 200). Then add region, multiplies further (say 3 regions: 600). It can creep up. Always consider how labels multiply out. Prometheus suggests keeping cardinality of any metric under a few hundred at most (How dangerous are high-cardinality labels in Prometheus?). If you see metrics with 10k+ series, question if all those dimensions are needed simultaneously.
Staleness and Sparse Metrics: Another design problem is metrics that appear and disappear dynamically (sparse metrics). For example, a metric that is only present when a certain event happens. If it’s absent normally, queries can be tricky (Prometheus treats missing series as “stale”). A tip from best practices: if you anticipate a series might not always exist, you can proactively export it as 0 when inactive, to avoid sparseness (Instrumentation | Prometheus). This ensures the series is always present (0 when no events), making alert logic simpler (you don’t have to handle “no data” vs “zero”). The client libraries often do this for you for counters – they’ll report 0 even if nothing incremented yet (Instrumentation | Prometheus).
Overly Granular Histograms: Histograms can also cause cardinality issues if misused. For instance, having a histogram with 50 buckets is fine. But if you label that histogram by something like route with 100 possible values, that’s 50*100 = 5000 series just for one metric family. Multiply by multiple histograms and you can see the cost. Make sure if you use labels on histograms, they’re truly needed and of low variety. Or consider using one histogram per major category, not per very detailed label.

Mitigation and Patterns: If you truly need high-cardinality analysis, one approach is to offload it to a logging or tracing system. Prometheus is optimized for aggregated data. If you needed per-job info, maybe you log each job’s runtime to Elasticsearch or a database, and use Kibana or Spark to analyze it. Prometheus could still alert on, say, “job failure count > 0 in 5m”, but not keep a time-series per job. Another approach is to use exemplars (a newer Prometheus feature) where you attach a trace or log ID to a few samples for context, rather than as a persistent label on all samples.

Also, if certain high-cardinality metrics are valuable but heavy, consider shorter retention or relabelling to drop or aggregate labels at scrape time. For example, you might use Prometheus’s drop/keep configurations to ignore metrics that exceed a cardinality threshold or strip certain labels you don’t need.

In HPC monitoring, one must be especially careful: there are thousands of nodes, hundreds of users, thousands of jobs – metrics involving all three could theoretically generate millions of series (e.g., cpu_usage{node, user, job} is a non-starter!). So metrics exported by slurm or node exporter don’t do that. They’ll give you node_cpu by node and CPU mode (like idle, system, user) which is fixed small set, or slurm_jobs_running{partition} etc. By sticking to aggregated counts (jobs per state, nodes per state, etc.), the SLURM exporter keeps cardinality manageable (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.) (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.). If you extend monitoring, say you write a script to export GPU process info on each node, be cautious not to include the process ID as a label – better to aggregate like “number_of_gpu_jobs_running_on_node”.

In summary: High-cardinality metrics can stealthily degrade your monitoring system. Always design metrics with questions: Do I need this level of detail? Can I aggregate or bucket this? When in doubt, lean toward fewer, more general metrics and use labels that correspond to stable, bounded sets (service names, roles, categories) rather than unbounded sets (user IDs, timestamps). By doing so, you keep Prometheus healthy and queries fast, which in turn ensures you can reliably alert on and understand your system’s behavior (Metric and label naming | Prometheus).

Advanced PromQL: Rates, Subqueries, and Metric Joins

PromQL’s expressiveness allows complex queries to support deep insights and alerting logic. Here we highlight some advanced uses: calculating rates, using subqueries for rolling windows, and joining metrics with vector matching and label manipulation.

Rate() for Counters: As mentioned, rate(counter[5m]) is the go-to for turning a monotonically increasing counter into a per-second rate over the last 5 minutes. For example, rate(http_requests_total[1m]) gives the requests per second over the past minute. Use irate() (instant rate) if you need the very latest two-sample rate (more spiky). Why 5m? Using a window smooths out short bursts and noise. Many official alerts use a 5m or 10m rate. E.g., an alert “High Request Rate” might check if rate(requests_total[5m]) > some threshold. Rate is fundamental for counters because raw counters are generally not meaningful without time normalization (Instrumentation | Prometheus).
Increase() for Counters: increase(counter[1h]) tells you how much the counter increased over the past hour (essentially the integral of the rate). This is great for measuring, say, “X events happened in the last hour”. For example, increase(jobs_completed_total[1d]) could give how many jobs completed in the last 24 hours.
Aggregations and Grouping: You can aggregate over any label using sum, avg, min, max, etc., and use the by() or without() clauses. For example, sum(rate(node_cpu_seconds_total{mode!="idle"}[5m])) by (instance) gives CPU usage per instance. If you drop by(instance), it would give total CPU usage across all instances (a cluster-wide number). One can also do avg by (job) or other dimensions as needed. Grouping is key to high-level dashboards (like overall cluster CPU utilization) and SLO calculations (like percent errors = sum errors / sum total * 100).
Vector Matching (Joins): PromQL allows binary operations on metrics, which effectively join two vectors of data on their labels. By default, if you do metricA * metricB, Prometheus will pair up time-series with exactly the same set of labels (excluding the metric name) (Left joins in PromQL – Robust Perception | Prometheus Monitoring Experts). This is akin to an inner join on all labels. Often, you need to join on a subset of labels – that’s where the on() or ignoring() clause and group_left/group_right come in. For example, suppose you have a metric node_cpu_seconds_total{mode, instance} and another metric node_total_memory_bytes{instance} (just as a hypothetical). If you wanted to compute CPU seconds per byte of memory per instance, you need to match by instance, but one metric has mode label and the other doesn’t. You could do:
```
sum(rate(node_cpu_seconds_total{mode!="idle"}[5m])) by (instance)
  / node_total_memory_bytes
```
PromQL by default will ignore the mode (because we aggregated it away) and match by instance. If labels don’t match exactly, you use on(label) to specify which labels to match on. Conversely, ignoring(label) will match on all labels except the ones listed (like ignoring mode). The group_left or group_right modifiers handle one-to-many or many-to-one joins. For instance, if one metric is per-node and another is per-cluster, you might use group_left to attach a cluster-level label to each node metric. A concrete example from documentation: a * on(foo,bar) group_left(baz) b means join metrics a and b where their foo and bar labels match; b has an extra label baz that a doesn’t, and group_left(baz) means include baz from the right-hand side in the result (Left joins in PromQL – Robust Perception | Prometheus Monitoring Experts). It also implies that for each combination of foo,bar labels, b may have one series and a may have many (many-to-one join, taking the left side’s multiple series and attaching the single right side value with its baz label to them) (Left joins in PromQL – Robust Perception | Prometheus Monitoring Experts). This is how you can, say, join a constant series of total cluster capacity to many per-node series to calculate a ratio.
label_replace() and label_join(): These are functions to manipulate label values. label_replace(vector, "new_label", "replacement", "src_label", "regex") can copy or transform an existing label. For example, you have instance="host:9100" and you want just hostname without port as a label, you could do:
```
label_replace(up, "hostname", "$1", "instance", "([^:]+):.*")
```
This will create a new label hostname by regex extracting everything before the colon from instance (PromLabs | PromQL Cheat Sheet). This is useful if your label naming isn’t ideal or you want to group differently. label_join() can merge multiple label values into one. E.g., combine region and az labels into a single region_az label. These functions operate per time-series in a vector and return a new vector with modified labels.
Subqueries: Introduced in Prometheus 2.7, subqueries allow you to get a time-series result as a series to further aggregate. Syntax is metric_or_expression[range:resolution]. For example, rate(http_requests_total[5m])[1h:30s] would produce a time-series of the 5m rate evaluated every 30s over the past hour (PromLabs | PromQL Cheat Sheet) (PromLabs | PromQL Cheat Sheet). Without subquery, you could only directly graph or alert on the current value of a rate, or use recording rules to store it. Subqueries make ad-hoc rolling window computations easier in the query itself. One common use is to compute a long window quantile from short window data. For instance, “what was the max 5m error rate over each 1h interval in the last day” – subqueries can help with that by first computing 5m error rate at some resolution, then applying max_over_time(...[1h]). They are powerful but be cautious with performance (subqueries pull potentially a lot of data into memory).

PromQL’s advanced features shine in creating derived metrics and sophisticated alerts. For example, consider an alert: “High CPU load relative to cores available”. You have node_load1 (1-minute load average) and node_cpu_seconds_total (for CPU count, you could use the count of mode="idle"+mode="user"+… labels or a known metric). You might do:

(node_load1{job="node"} / count(node_cpu_seconds_total{mode="idle"} by (instance)))
  > 0.8

to detect if load > 80% of CPU count on a node. Here you’re dividing two metrics after matching by instance (implicitly, or explicitly with on(instance)). This kind of algebra makes Prometheus more of a “metrics processing language” than just a query filter.

Another advanced alert could be based on absent() function – for example, if a metric from a service hasn’t been seen for 5 minutes, maybe the service is down. absent(up{job="myservice"} == 1) can fire when there are no targets up for that service.

For joining metrics that don’t share labels, sometimes you label one side via recording rule or static config to enable a join. For instance, if you want to attach datacenter info to each node metric, and you have an external file listing node->DC, you could create a gauge metric like node_meta{instance="foo", datacenter="east"} = 1 for each node (perhaps push it via textfile to node exporter). Then in PromQL do node_cpu_seconds_total * on(instance) group_left(datacenter) node_meta to get a version of node_cpu with the datacenter label.

In HPC monitoring, these techniques are useful. For example, join SLURM metrics with node exporter metrics: If SLURM exporter gives slurm_node_state{state="allocated", nodename="foo"} as 1/0, and node exporter has node_energy_joules_total{instance="foo:9100"}, you might align them by some label (though nodename vs instance string differences require label_replace). One could calculate “power usage of allocated nodes vs idle nodes” by joining the two data sets.

Overall, advanced PromQL allows you to derive insights that raw metrics alone don’t show. It’s often where the monitoring magic happens: defining alerts that combine multiple conditions (e.g., high error rate and high latency -> very bad), or creating summary metrics for dashboards (like cluster efficiency = running cores / total cores, using metrics from Slurm exporter divided by a constant total). Mastering these techniques enables writing precise alerts that reduce noise – for instance, alert on a ratio or trend rather than a single metric threshold. It also helps in capacity planning analysis, anomaly detection, and generating SLO reports from raw data.

4. Resource Scheduling in Depth

SLURM’s scheduling can be tailored, but let’s break down common strategies:

First-Come, First-Served (FCFS): By default, SLURM will try to schedule jobs in submission order (oldest first) if resources are available. However, pure FCFS can lead to suboptimal utilization (a big job at the top might block many small ones behind it). That’s why SLURM rarely runs strictly FCFS; it introduces priority factors and backfill. But conceptually, jobs are sorted by a priority value (which could be just their queue time, implementing FCFS) (Scheduling Basics - HPC Wiki). If a top job can’t run (waiting for enough nodes), the scheduler will look down the queue.
Priority & Fair-Share: SLURM’s multifactor priority plugin computes a priority for each pending job based on factors like job age (how long it’s been waiting), job size (resources requested), QOS level, and fair-share usage (Slurm Workload Manager - Overview) (Slurm Workload Manager - Classic Fairshare Algorithm). Fair-share means each user or account has a target share of the cluster. SLURM tracks actual historical usage (via SlurmDBD) and adjusts job priority: users who have used more than their fair share get lower priority on new jobs, and users under their share get a boost (Slurm Workload Manager - Classic Fairshare Algorithm). This ensures equitable usage over time. The fair-share algorithm often implemented in SLURM is a weighted “fair tree” that ranks users by an assigned fair-share value (Slurm Fairshare Refresher | High-Performance Computing - NREL). Administrators can configure how heavily fair-share weighs relative to other factors (for example, you might still favor small jobs or urgent QOS jobs over fair-share at times). The priority formula is typically something like: Priority = f(FairShare, Age, JobSize, QOS, Partition) where each component is normalized to some range. A nice value can also be set by users to lower their own job priority (much like nice in Linux) ([PDF] Slurm Priority, Fairshare and Fair Tree - SchedMD).
Preemption: In some setups, certain jobs can preempt others. SLURM supports preemption via configuration of QOS or partitions. For example, a high-priority queue could be set to preempt lower QOS jobs (either by suspending them or killing and requeuing them) if needed to free resources. This is used in scenarios like urgent computing or to ensure short critical jobs aren’t stuck behind long low-priority jobs. Preemption can work by suspension (the job’s processes are SIGSTOP’d, freeing CPU but holding memory) or checkpoint/requeue if the job is run with checkpointing.
Backfill Scheduling: This was discussed earlier – SLURM’s scheduler will scan the queue to find jobs that can run in the “holes” of resources while waiting for a higher priority job at the top to start (Scheduling Basics - HPC Wiki). The backfill scheduler respects job reservations for higher priority jobs. For example, suppose JobA (high priority) needs 100 nodes for 2 hours, but only 50 are free now. The scheduler might reserve 100 nodes that will free up in 1 hour for JobA (meaning JobA will start in 1 hour when those become available). Meanwhile, if there are smaller jobs down the queue that can fit into the currently free 50 nodes and finish within 1 hour (so they’ll be done by the time JobA needs them), those can be started now via backfill (Scheduling Basics - HPC Wiki). Backfill thus requires accurate job time limits (users provide --time for each job) – if a backfilled job exceeds its time and interferes with the reserved job, SLURM will kill it when the high-priority job starts. HPC centers often encourage reasonably accurate walltime estimates to maximize backfill efficiency. Without backfill, those 50 free nodes would sit idle for an hour waiting for the big job. With backfill, they get used.
Topology-Aware Scheduling: In clusters with a non-uniform network topology (like a fat-tree network or Dragonfly where some nodes are “closer” to each other), SLURM can employ topology-aware scheduling (Slurm Workload Manager - Overview). This means the scheduler tries to allocate nodes for a job that are close together network-wise, to minimize communication latency for MPI jobs. SLURM can be configured with a topology plugin that knows about the switch layout. For example, if a job needs 16 nodes, topology-aware scheduling might allocate 16 nodes within the same rack or network switch group if possible, rather than 8 on one switch and 8 on a distant switch, to improve the job’s performance. This is more of a concern on very large clusters and is usually a secondary optimization after core scheduling decisions. It can reduce network contention and is a form of packing strategy. Another angle is NUMA topology on a single node – but that is usually handled by job configurations (like --ntasks-per-socket options) rather than the scheduler’s global view.
Partitioning: Administrators can divide the cluster into partitions (like queues) which can have different scheduling policies or usage limits. For example, a “debug” partition might allow only short jobs (time limit 30 min) and have priority over the general partition for quick turnaround. Or a “gpu” partition that includes only GPU-equipped nodes. Users direct their jobs to a partition (default if not specified). Each partition can have its own scheduling parameters: some might not allow backfill or have different nice offsets, etc. Partitions can also overlap (a node can be in multiple partitions), giving flexibility (e.g., a node might be usable for either “general” or “gpu” jobs). When a job is submitted, it’s associated with a partition, and SLURM’s scheduler essentially maintains separate queues per partition (though if partitions share nodes, jobs from either could compete for those nodes, resolved by priority). This is a way to implement policy: e.g., a partition for a specific project ensures only that project’s jobs run on its nodes, or a preemptible partition uses cheaper nodes and lower QoS.

In operation, SLURM’s scheduler (slurmctld) runs periodically (by default every few seconds) to update priorities and attempt to start jobs. It uses the configured algorithm: sort jobs by priority, try to allocate resources, apply backfill to fill gaps, etc. There is also a concept of sched cycle and you can get stats via sdiag (scheduler diagnostics) which shows things like how long each cycle takes and how many jobs were considered/backfilled (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.) (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.). Tuning may be required for very large systems – e.g., limiting how deep in the queue the backfill algorithm looks, to avoid spending too long computing.

Summary: SLURM’s scheduling takes into account when the job was submitted (Age/Age factor), who submitted it (Fairshare usage), what resources it needs (favoring smaller jobs for backfill or via size factor), what QoS/partition it’s in, and any dependencies or reservations. Preemption and backfill improve utilization and responsiveness for high-priority work. Topology and partitions implement site-specific policies and optimizations. The end goal is to keep the cluster busy (no idle resources unless by policy) while meeting priority rules. Admins have a lot of control: they can define QOS (Quality of Service) levels that bundle priority boosts or preemption ability, and assign jobs to QOS. For example, a “high” QOS could give +1000 priority but limited to a group of users or limited in number of jobs. All these knobs influence the multifactor priority plugin’s outcome (Slurm Workload Manager - Overview).

From a monitoring perspective, understanding these algorithms is important: for example, if jobs are waiting a long time, is it because fairshare is throttling someone? Or is a partition at capacity? Metrics like “pending jobs per partition” or “age of oldest job” can reveal if a scheduler policy is causing a bottleneck. SLURM’s exporter even gives some metrics like scheduling cycle time and backfilled jobs count (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.), which can indicate if backfill is actively working or if scheduler is overloaded.

Resource Containers: cgroups and Job Constraints (`--mem`, `--cpus-per-task`, `--gres`)

SLURM ensures that when multiple jobs share a node, they don’t interfere by using Linux cgroups (control groups) to constrain resources per job (Slurm Workload Manager - Control Group in Slurm). Cgroups group a job’s processes and enforce limits on CPU, memory, and more:

CPU Enforcement: If a job requests, say, 4 CPUs, SLURM will allocate 4 CPU cores to it (possibly as 4 out of the node’s 32 cores) and then create a cgroup limiting the job’s processes to those CPU cores (cpuset) (Slurm Workload Manager - Control Group in Slurm). This ensures the job cannot use other cores. It also can enforce CPU shares or quota – if jobs are time-sharing a core, cgroups can weight them. But typically in HPC, jobs get exclusive core access.
Memory Enforcement: The --mem option in SLURM requests a certain amount of memory per node (or --mem-per-cpu for per CPU). SLURM’s cgroup plugin will set the job’s cgroup memory limit to that amount (Slurm Workload Manager - cgroup.conf). If the job tries to use more, the Linux OOM killer will terminate processes in that cgroup (effectively killing the job, which SLURM will note as out-of-memory failure). This containment prevents one job from crashing a node by using all memory and pushing the OS to OOM other jobs. HPC clusters often must configure the cgroup memory plugin for reliability (Slurm Workload Manager - Control Group in Slurm), otherwise a single rogue job can consume all RAM and starve others.
GPU and other Generic Resources (GRES): SLURM can manage generic resources like GPUs, MIC (Xeon Phi), or even licenses, via the GRES mechanism (Using GPUs with Slurm - Alliance Doc). If a user requests --gres=gpu:2, SLURM will allocate 2 GPUs on a node to that job. It uses cgroups or vendor-specific libraries (like Nvidia’s control) to enforce that only those GPUs are visible to the job’s processes. For example, it will set CUDA_VISIBLE_DEVICES for the job, or use the devices cgroup to restrict GPU device files accessible (Slurm Workload Manager - Control Group in Slurm). GRES are configured in gres.conf on each node to tell SLURM how many of each resource the node has. Modern SLURM also has --gpus option (alias to GRES) and can schedule GPU resources more natively. The exporter metrics we saw track GPU allocation cluster-wide (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.). GRES can also cover things like high-memory nodes or special hardware (just defined as a resource with certain name).
Task Affinity (cpus-per-task, etc.): When launching parallel tasks, SLURM gives options like --cpus-per-task (how many CPU cores each task of a job can use), --ntasks (number of parallel tasks), --nodes (number of nodes). These translate into how SLURM sets up cgroups on each node. For instance, --nodes=2 --ntasks=4 --cpus-per-task=2 means 4 tasks across 2 nodes (so maybe 2 tasks per node) each needing 2 CPU cores. Slurm will allocate 2 cores per task (so 4 cores per node) and isolate each task’s process to its assigned cores via cgroup cpuset. The --threads or --ntasks-per-core flags further refine placement (like whether hyperthreads count as separate or not). All of these ensure that MPI ranks or multi-threaded parts of jobs get the right resources.
Memory, Hugepages, etc.: SLURM’s cgroup support extends to things like huge pages, network QoS, and devices. For memory, beyond just a total limit, it can enforce NUMA bindings if a job’s allocation spans specific NUMA nodes. But typically, just ensuring the job cannot exceed its memory allocation is the main use. The snippet from cgroup.conf indicates one can even enforce a percentage overhead or swap usage (Slurm Workload Manager - cgroup.conf).

The net effect is that each job (or in SLURM terms, each “job step” if srun launches sub-tasks) runs in a sandbox of CPU cores, memory, and devices. This is crucial on shared nodes – without it, one job could hog CPU (say run more threads than requested) or consume all memory (causing others to crash). With cgroups, if a job tries to use more CPU than allocated, the scheduler and OS simply won’t allow its threads onto other cores (they’ll be constrained to their cpuset and if fully busy, additional threads just wait). If it tries to use more memory, it gets killed by cgroup limit.

For HPC jobs that require the whole node, cgroups have less impact (the job already has exclusive node). But in many clusters, nodes are shared among smaller jobs, so this isolation is critical.

Constraint Flags: A few important SLURM resource request flags and their meaning:

--nodes (or -N): number of nodes requested. Can be a range (e.g., --nodes=4-8 meaning “between 4 and 8 nodes, allocate as many as scheduler can – but this is rarely used unless job can scale dynamically).
--ntasks (or -n): total number of tasks (processes) to run. If you use mpirun or SLURM’s built-in launch with srun, SLURM will spawn this many processes (perhaps distributed across nodes). Often combined with nodes, e.g., --nodes=4 --ntasks=16 implies 4 tasks per node on 4 nodes.
--cpus-per-task: number of CPU cores a single task will use (e.g., if each task is multi-threaded with OpenMP, you might ask for 8 cores per task). This ensures SLURM allocates those cores together and sets OMP_NUM_THREADS=8 for example.
--mem or --mem-per-cpu: memory allocation. --mem is a per-node memory, --mem-per-cpu multiplies by CPUs. They end up as a total memory cgroup limit. If not specified, a default or the whole node might be assumed, which can cause conflict if multiple jobs share the node without explicit memory splits. HPC centers often set a default per-core memory so if user doesn’t specify, SLURM still divides memory proportionally.
--gres: as mentioned, generic resources like GPUs. Syntax --gres=gpu:K80:2 could request 2 Tesla K80 GPUs (if there are multiple GPU types). On output, SLURM sets GRES environment or uses cgroup devices to give the job those GPU device files.
--constraint: not a resource quantity, but a filter for node attributes (like --constraint="[gpu&32GB]" to only use GPU nodes with 32GB memory). This uses node feature flags.
--time: (not a hardware resource, but a limit) – the maximum runtime. The scheduler uses this for scheduling (backfill) and will kill the job if time is exceeded (enforced by slurmctld, not cgroup). Also used for fair-share decay calculations.

These flags translate to internal resource accounting that the scheduler uses to allocate and the slurmd uses to enforce via cgroups and other OS controls (Slurm Workload Manager - Control Group in Slurm) (Slurm Workload Manager - Control Group in Slurm).

On the monitoring side, one interesting aspect is capturing when cgroup limits are hit. For instance, if jobs frequently get killed for OOM (out-of-memory), one might want to alert. SLURM logs that, and some metrics could be derived (maybe the slurm exporter could count job failures by reason). Also, node exporter can show if processes are being throttled by cgroups (some metrics like container_cpu_cfs_throttled_seconds_total if cAdvisor was used). HPC centers sometimes run Prometheus node exporter with cgroup metrics enabled to see per-cgroup resource usage, but that can be high-cardinality (since cgroup names include job IDs by default). If needed, one could aggregate by state: e.g., show total memory used by all jobs on a node vs total memory.

The bottom line is SLURM leverages Linux kernel features to implement resource containers for jobs, very much akin to what container orchestrators (like Kubernetes) do for pods. In fact, running a job in SLURM is conceptually similar to running a Docker container with CPU/mem limits – both end up creating cgroups. HPC users typically don’t notice cgroups except when they hit a limit (job killed for memory, or using more CPUs does nothing). It provides accounting accuracy too – SLURM can precisely measure a job’s CPU time, memory max, etc., via cgroup controllers (for job accounting records).

Scalability: Job Arrays, Controller Failover, and RPC Architecture

Large HPC systems may have to manage many jobs (millions in backlog) and many nodes. SLURM has several features to handle scale:

Job Arrays: A job array is a single job submission that represents multiple similar jobs, indexed by a task ID. For example, a user can submit an array of 100 jobs with one sbatch command: sbatch --array=1-100 myscript.sh. This is far more efficient than 100 separate submissions. All array tasks share the same job script and resource request, but each gets an SLURM_ARRAY_TASK_ID to differentiate (Slurm Workload Manager - Job Array Support) (Slurm Workload Manager - Job Array Support). Internally, SLURM treats them as 100 jobs with a common identifier and sub-ID, which can dramatically reduce overhead in job scheduling. The slurmctld can schedule them like normal jobs (and may dispatch many of them in one scheduler iteration if resources free up). Users often use job arrays for parameter sweeps or large sets of independent tasks (e.g., processing 1000 input files with the same code). Monitoring-wise, job arrays still appear as many jobs (e.g., squeue will list each task unless compressed output). But slurmctld stores them compactly. There are limits on array size (configurable, often up to 10k or 100k). There’s also a %N limit where user can say “only run N tasks of the array at once” (--array=1-100%10 to run max 10 concurrently) (Slurm Workload Manager - Job Array Support), which helps avoid flooding the cluster. Arrays improve scalability by lowering submission overhead and giving the scheduler hints to treat a set of jobs uniformly. Prometheus metrics from the slurm exporter might include counts of jobs in arrays, or perhaps treat each task individually – likely they just count all tasks.
Controller Failover: As mentioned, you can have a backup slurmctld ready. It uses a mechanism where the primary does a checkpoint of state to disk (and/or communicates with slurmdbd), and the backup monitors the primary’s heartbeat. If primary goes down, backup takes over the management IP/hostname (or via virtual IP or etc.). This is more about reliability than scale, but it ensures that even a very large cluster has high availability of the scheduler. Typically only one controller is active at a time (active-passive). There’s work on potential active-active in future, but it’s tricky to avoid conflicts. Failover is configured by BackupController in slurm.conf. For Prometheus, monitoring the health of slurmctld is vital – e.g., an exporter can have a metric for “slurmctld alive” or one could scrape slurmctld’s port. If the primary dies and backup takes over, perhaps an alert should fire or at least record that event.
RPC and Communication Architecture: SLURM is designed for thousands of nodes, so the communication between slurmctld and slurmd is optimized. It uses RPC messages over sockets, often on a dedicated internal network. Slurmd daemons form a tree for certain communications (Slurm can use a hierarchical communication mode where a subset of slurmd act as intermediates for messages to reduce load on the controller) (Slurm Workload Manager - Overview). For instance, in a huge cluster, slurmctld might not directly send a message to each of 10k nodes for a job launch – instead it could send to a “controller hierarchy” where maybe each rack has a representative. This is the tree communications mentioned: “slurmd daemons provide fault-tolerant hierarchical communications” (Slurm Workload Manager - Overview), meaning if a node is down, messages can route around, and communication can be aggregated. This helps scalability by reducing message storms on the controller.

Additionally, SLURM is configured with message timeouts and rates to handle load. If a user submits 100k jobs (not as array), slurmctld would have to handle that many RPCs – it can become bottleneck. That’s one reason job arrays are encouraged. Also, squeue for 100k jobs can be heavy; there are options to limit such queries or require specific filters.
Throughput and Tuning: Large HPC sites measure scheduler throughput (jobs per second it can start). SLURM is known to handle high throughputs, but sometimes scheduling cycle may be extended to gather multiple jobs to dispatch together for efficiency. The exporter metrics show things like “cycles per minute” and thread count of slurmctld (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.). Slurmctld is multi-threaded, for example it can schedule and respond to RPCs concurrently up to a thread count. If scheduling is slow (taking many seconds), metrics like Slurm Scheduler Threads in Grafana (which was 5 in the example dashboard) (SLURM Dashboard | Grafana Labs ) (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.) and Agent Queue Size (messages pending to DB) help identify bottlenecks (in the example dashboard [82], scheduler threads = 5, agent queue = 0 indicates no backlog, meaning DB writes are keeping up).
Job Submission Filter and SPANK Plugins: Slightly tangential, but as scale features: SLURM allows site-specific submission filters (Lua scripts or programs) to automatically adjust or reject jobs on submission (to enforce policies without manual admin). Also, SPANK plugins can run on job prolog/epilog on nodes to set up environment (like a plugin to load certain monitoring in job). These help scaling by automating management for large numbers of jobs/nodes.

In essence, SLURM’s design goals include being highly scalable (10k+ nodes, 100k+ jobs in queue) (Slurm Workload Manager - Overview). Features like job arrays and hierarchical communication address the scale of jobs and nodes respectively. Most HPC centers push these to their limits (some do millions of jobs per month and track scheduler latency). When Prometheus monitoring such a system, one could create alerts like “Scheduler cycle time > threshold” from sdiag metrics, to catch if the scheduler is struggling to keep up (which could happen if, say, someone floods 1e6 jobs).

Prometheus itself should be scaled accordingly – scraping a slurm exporter that lists thousands of jobs or nodes is fine (since metrics are aggregated counts, cardinality is not huge). But if one tried to export every job as a metric (bad idea, as discussed), that would hit Prometheus limits.

Accounting and Usage Tracking: slurmdbd and Job States

One of SLURM’s strengths for large clusters is robust accounting: every job’s resource usage and state transitions can be recorded for reporting and enforcing policies:

Job State Lifecycle: A job typically goes through states: PENDING (waiting in queue) -> RUNNING (dispatched to nodes) -> COMPLETED (finished successfully) or CANCELLED/FAILED (ended with error or killed) (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.) (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.). There are also intermediate states like COMPLETING (cleanup after a job ends) (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.), SUSPENDED (paused by preemption) (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.), TIMEOUT (killed due to time limit) (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.), NODE_FAIL (failed due to node crash), etc. SLURM assigns a reason if a job can’t run (e.g., PartitionDown, Dependency, etc., visible in squeue -t pending -o %r). From a monitoring perspective, you might want to alert if many jobs go into certain bad states (NODE_FAIL or FAILED unexpectedly).
slurmdbd Accounting: When enabled, each time a job ends, slurmctld sends a record to slurmdbd which logs it into an SQL database. The record includes job ID, user, account, partition, nodes used, start time, end time, exit code, CPU seconds consumed, max RSS memory, etc. These records allow generating usage reports: for example, total core-hours used by each account per month, or average wait time for jobs in a partition, etc. Commands like sacct (to list jobs after they finish) and sreport (pre-canned reports) query this database. From a metrics standpoint, you typically wouldn’t import all this historical data into Prometheus (it’s too detailed). Instead, one might compute summaries externally or feed some key stats to Prometheus (like “total cluster hours used per day” as a gauge). However, the slurm exporter provides some live usage metrics: it can query slurmctld/slurmdbd for things like number of running jobs per user or account (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.), and perhaps cluster utilization.
Usage Tracking for Fairshare: slurmdbd keeps track of cumulative usage (often in terms of CPU-seconds or normalized core-hours) for each account/user, which feeds the fair-share priority calculation. Fair-share essentially compares a user’s recent usage to their allocated share (can be updated periodically) (Slurm Workload Manager - Classic Fairshare Algorithm). If using fair-share, you might want to monitor the fair-share values – but those are internal. However, sshare command or the slurm exporter’s “Share Information” might provide metrics like effective fairshare score per account (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.). In the Grafana dashboard snippet, there is a “Share Information” section implying it collects some stats from sshare (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.).
Job Submit and End Rates: HPC centers might monitor how many jobs are being submitted/completed per hour, to detect anomalies (e.g., a user script going haywire submitting thousands of jobs inadvertently). While slurmdbd can provide totals, Prometheus could scrape counts of jobs in each state over time to infer rates. The slurm exporter has metrics for “Total jobs started thanks to backfill since last start” (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.) which is a cumulative counter of backfilled jobs. It also likely has a counter of total jobs executed. These can be used to see throughput.
User and Account Quotas: SLURM can enforce quotas like max jobs per user, or max running jobs per account, via the accounting settings. Monitoring could be set up to alert when someone hits a quota, but more often, users learn via job rejections. Still, an admin might want to know if a user is constantly hitting their limits (maybe need to adjust share). Metrics such as “pending jobs per user” might show if someone has a large backlog (maybe hit limit, or waiting on resources).
Node State Accounting: SLURM tracks node states: IDLE, ALLOCATED, DOWN, DRAINING, MAINT, etc. (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.). These indicate cluster health and availability. For example, DOWN means the node is unreachable or had a failure; DRAINING means it’s scheduled to be taken out of service (no new jobs will start, but existing can finish); DRAINED means it’s out of service; FAIL state is also used for hardware failure detection (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.). The exporter clearly provides counts of nodes in each state (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.), which is great for a dashboard and alerts (e.g., “more than 5 nodes down or failing” triggers an alert, as that might mean an outage in part of the cluster). Over time, you can also track if Idle nodes are consistently high while jobs are pending, which indicates either scheduling issues or something like fragmentation (jobs can’t fit into available chunks). Also tracking “Mixed” state (node partially allocated) and utilization metrics helps see how well-packed the cluster is. A high number of Idle cores alongside long queues could mean jobs have big minimum node counts that can’t be satisfied or a partition misconfiguration.

In summary, SLURM’s accounting provides a rich source of cluster activity data. Not all of it is ideal for time-series metrics due to volume, but key aggregates (job counts by state, resource utilization, wait times) are. The integration of SLURM and Prometheus (through the exporter) focuses on exposing just such aggregated metrics to avoid the deluge. For example, the “Status of the Jobs” metrics in the exporter give a snapshot of how many jobs are in each state (Pending, Running, Suspended, Completed, etc.) (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.), and even categorized pending jobs by reason (dependencies, etc., possibly). This complements what the accounting DB would tell you after the fact by showing real-time queue status. An admin can glance at Grafana and see, for instance, 2000 jobs pending (1000 of them waiting for a dependency, 1000 for resources), 500 running, 2 nodes down, scheduler is keeping up (short cycle times), etc. All these are crucial for managing an HPC facility effectively.

5. Integration & Co-Deployment of Prometheus and SLURM

Prometheus + SLURM: Metrics Exporter and Grafana Dashboards

By integrating Prometheus monitoring with a SLURM-managed cluster, administrators can get real-time visibility into both the infrastructure and the workload scheduler. The primary integration point is the SLURM exporter for Prometheus. This is a software (often a Python or Go daemon) that queries SLURM (via commands or the Slurm REST API) and exposes metrics for Prometheus. One such exporter is available on GitHub (Prometheus exporter for performance metrics from Slurm. - GitHub) and is also referenced in Grafana dashboards (SLURM Dashboard | Grafana Labs ).

SLURM Exporter Metrics: The SLURM Prometheus exporter typically provides metrics on: cluster resource usage, job queue stats, and scheduler performance. For example, it exposes:

Cluster Node States: As discussed, metrics like slurm_nodes_state{state="idle"} = N, state="allocated" = M, etc. In the Grafana “SLURM Dashboard”, they show State of the Nodes – e.g., how many nodes are Allocated, Idle, Down, Drained, etc. over time (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.). This helps track overall usage (e.g., if most nodes are allocated vs idle) and detect problems (nodes in down/drain states).
CPU/GPU Allocation: Metrics for total CPUs allocated vs idle cluster-wide (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.), total GPUs allocated vs total (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.). These essentially reflect utilization. For instance, “Allocated CPUs” vs “Idle CPUs” across the cluster (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.) can be plotted to see if the cluster is fully busy or if there’s spare capacity. In a Grafana screenshot, we might see a line graph of CPUs in use vs total (the example chart “Cluster Nodes” or “State of the CPUs” shows these values trending over time) (SLURM Dashboard | Grafana Labs ) (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.). Similarly for GPUs if present.
Job Counts: The exporter gives counts of jobs in various states (running, pending, completing, etc.) (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.). Grafana panels “SLURM Jobs” likely show a time-series of running vs pending jobs over time (SLURM Dashboard | Grafana Labs ). You might see pending jobs spiking when many jobs are submitted, then running jobs increasing as they start. There can also be breakdowns: e.g., Running/Completed jobs vs Failed/Cancelled/Timeout jobs in separate panels for clarity (SLURM Dashboard | Grafana Labs ). Completed isn’t a state while running, but maybe they show how many completed in that interval (or use increase of completed counter).
Jobs by User/Account: The exporter specifically lists “Running/Pending/Suspended jobs per SLURM Account” and similarly per User (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.). This means it exports metrics like slurm_jobs_running{account="deptA"} = X. This is extremely useful for multi-tenant clusters: you can see which projects are consuming the cluster. Grafana can display top N users or accounts by running jobs. If one account suddenly has 90% of the jobs, maybe they’re dominating or maybe it’s expected because others are idle. This ties into fairness monitoring.
Scheduler Stats: Metrics extracted from sdiag (the scheduler diagnostics): things like number of scheduler threads, scheduler queue length, last cycle duration, mean cycle time, backfill cycle times, backfilled job counts (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.) (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.). In the Grafana dashboard, the “Slurm Scheduler Details” panel shows numeric values like Slurm Scheduler Threads = 5, Agent Queue Size = 0 (meaning no backlog of messages to the DB), and graphs below show “Scheduler Cycles” and “Backfill Scheduler Cycles” timings. These help ensure the scheduler is healthy. For instance, if “Agent queue size” (to slurmdbd) is growing, the Grafana note says this likely indicates SlurmDBD or DB issues (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.). If “Last cycle time” is very high (say scheduling cycle takes 60s), that means the scheduler is overloaded deciding what to run; one might need to increase thread count or adjust config.
Accounting Stats: The exporter also can export share info (from sshare), though Grafana default dashboard might not plot those by default. It likely provides gauges for accounts like raw shares vs usage. But this might be less commonly visualized.

Grafana Dashboards: Grafana is commonly used to visualize Prometheus data. There are community dashboards for SLURM (as indicated by Grafana Labs dashboard ID 4323 (SLURM Dashboard | Grafana Labs )). These dashboards typically have rows for:

Cluster Overview: total nodes, nodes by state, total CPUs vs allocated CPUs (perhaps as an area graph Idle vs Allocated vs Other states) (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.).
Job Queue: graph of running vs pending jobs, and possibly table of largest pending jobs or something.
Jobs by user: maybe a bar chart of current running jobs by top users.
Scheduler health: graphs of scheduler cycle time, etc., often for admins.
Partition usage: possibly separate panels per partition if cluster is split (e.g., GPU partition usage vs CPU partition).

From the snippet, the Grafana dashboard lists the metrics displayed: State of CPUs/GPUs, State of Nodes, Status of Jobs (with breakdown by Account/User), Scheduler Info, Share Info (SLURM Dashboard | Grafana Labs ). Essentially, exactly those provided by the exporter.

By deploying this, an HPC ops team gets a live view rather than relying solely on command-line tools. For example, instead of running sinfo and eyeballing, they have a time history of how many nodes were idle over the last week, or how backlog grew during a big submission burst. This can inform if they need more hardware or if scheduling parameters need tuning (like if many nodes idle but jobs pending, maybe scheduling fragmentation issues or job sizes that don't fit current nodes).

Setting up the exporter usually means running it on the SLURM controller node (where it can query slurmctld or run commands like sinfo, squeue). It then listens on a port (say 8080) for Prometheus to scrape. The Prometheus scrape config for it might look like:

scrape_configs:
- job_name: slurm
  static_configs:
    - targets: ['slurm-master.example.com:8080']
  scrape_interval: 30s
  scrape_timeout: 30s

They even recommend in the docs to use 30s interval and timeout to avoid overloading slurmctld (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.) (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.), since some of these queries (like squeue for all jobs) can be heavy. A 30s interval strikes a balance between freshness and load.

In addition to cluster-wide dashboards, one could integrate with Grafana’s ad-hoc filters or variables to drill into specific users or partitions. E.g., a variable to select an account and then show how many jobs that account has had over time.

Alerts with Prometheus: With SLURM metrics in Prometheus, we can set up alerts to catch conditions:

“Too many failed jobs”: if slurm_jobs_failed increase is high, maybe node issues.
“Nodes down”: if slurm_nodes_state{state="down"} > 0 for 5 minutes, alert, as a node might have crashed.
“Scheduler not responding”: perhaps if exporter fails to scrape or if slurm_scheduler_last_cycle metric hasn’t updated (could indicate slurmctld hung).
“Cluster Idle but queue non-empty”: if idle CPUs > 10 and pending jobs > 0 for prolonged time, that suggests jobs aren’t being scheduled (could be due to dependency hold or misconfig).
“High pending jobs per user”: maybe one user has > X jobs pending (perhaps indicating a flood of jobs that might need attention or an inefficiency).
“High utilization”: If cluster is constantly 100% allocated, maybe that’s fine (expected in HPC) – or maybe alert if usage is consistently low (like under 50% for a day, indicating either a lull or a problem where jobs aren’t coming in as expected).

These alerts help operators be proactive. For example, an alert for idle nodes while queue is long could catch if a partition was mistakenly left in a drained state.

Alerting on Failed Jobs, Idle Nodes, and Resource Saturation

Let’s consider some concrete alert scenarios combining Prometheus and SLURM knowledge:

Failed Jobs Alert: Not all job failures should alert ops (users often have errors in their jobs), but a spike in system-caused failures should. If a node goes bad, many jobs running there might fail (with Node_fail or ExitCode=... possibly in SLURM). The exporter might provide slurm_jobs_failed count. We could have an alert like: If number of FAILED jobs in the last 5 minutes > N, alert. Or specifically if any node has multiple job failures (maybe derive from logs, but metrics could track failure count by node if instrumented). Lacking direct metrics, one could approximate: if a node goes to DRAIN state because of failures, that we catch via node state metrics. So an alert could be “Node in DRAINED (error) state” – which often happens after repeated job failures on it.
Idle Nodes with Long Queue: HPC centers want full utilization. If nodes are idle while jobs are waiting, something’s off. Perhaps jobs are waiting for specific resources not available on those idle nodes (like GPU jobs waiting while CPU-only nodes idle). This might be okay or not. But an alert can identify if a significant portion of the cluster is idle while there is work that could run. For example: if (IdleNodes > 10) and (PendingJobs > 0 for 30m) then alert. This may catch cases where maybe a misconfigured partition is preventing jobs from using idle nodes. Or simply notify that cluster is underutilized (maybe okay if it's overnight, but if persistent, maybe adjust scheduling or capacity).
Excessive Queue Times / Saturation: If the cluster is saturated (100% busy) and lots of jobs are queued, users may experience long wait times. We might alert if pending jobs count stays very high for a sustained period. Or if oldest job waiting time metric exceeds a threshold. The exporter might not directly give oldest wait, but one could gauge by how long large pending list persists. Another saturation indicator: jobs being backfilled often? Actually backfill count high is good (utilizing holes). Instead, maybe alert if many high-priority jobs are pending (maybe meaning even top priority jobs can’t start, implying total saturation).
Unbalanced usage / Starvation: Perhaps one account is hogging and others are starved. If fairshare is in effect, that should resolve over time, but maybe alert if some account has 0 running jobs but >100 pending for over a day (indicating they might be starved or put on hold).
Hardware issues: Based on metrics like node state or temperature. If integrated with node exporter, could alert on CPU temperature or ECC memory errors (if those are exported) per node. Not directly SLURM’s job, but part of overall monitoring.
Slurm Daemon Down: If the slurm exporter cannot scrape (Prometheus target down) or if it scrapes but indicates “slurmctld not responding” (some exporters might export a metric slurm_up akin to how node exporter has node_scrape_collector_success). We should alert quickly if the scheduler is down, as that means no new jobs will start, which is an outage.
Database issues: If slurmdbd’s agent queue (messages to DB) is growing, as the exporter docs note (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.), that could warrant an alert to check the database status.

Creating these alerts in Prometheus Alertmanager completes the integration: you get pager or email alerts for HPC cluster conditions just as you would for a web service.

Metric-Log Correlation (Prometheus with Loki/Elasticsearch)

Metrics give numeric overviews, but often we need to drill into logs for details – for instance, if an alert “node down” fires, one will want to see that node’s logs around the failure. By correlating Prometheus metrics with logs (via Loki or Elastic), operators can pivot from an alert to relevant logs easily.

Grafana Loki is a log aggregation system that integrates well with Prometheus. It stores logs with labels (similar to Prom labels) and allows queries for log streams. A common practice is to use consistent labels for metrics and logs, so that Grafana’s “Explore” can link them. For example, label logs with hostname and metrics also have instance or nodename. If an alert comes in “node foo is down”, one can filter logs in Loki for {hostname="foo"} and time range around the alert to see what happened (maybe hardware error messages, kernel panic, etc.). Grafana even supports linking an exemplar trace or logs to a metric datapoint if you set it up.

Elasticsearch/Kibana similarly could be used – for instance, system logs (syslog, dmesg, slurmctld logs) shipped to Elastic. If an alert “Job failure spike” occurs, you’d search the SLURM controller log for error lines (e.g., node communication errors).

To make correlation smoother, some integrate via Grafana’s explore: you can have a graph of “Failed jobs” in Grafana, click on a data point spike, and jump to logs at that timestamp with the relevant context. Loki can use the same label (like job ID or node) if those are included in log messages.

For HPC, interesting logs include:

SLURM controller log (tracks job submissions, completions, errors scheduling).
SLURM node daemons logs (if a node had an issue running a job).
System logs of nodes (hardware errors, etc.).
Application logs of jobs (though usually not aggregated centrally, unless users push them to a system).

If using something like Elastic Stack, one could index events like job completion as structured logs (via Slurm accounting or a DB hook). But that might duplicate metrics. Instead, focus on things metrics can’t tell: cause of failure, performance messages.

One cool approach is linking trace IDs: for microservices, PromQL can attach a trace exemplar (like Jaeger trace ID) which Grafana can link to. In HPC, not typically applicable unless using distributed tracing for say workflow steps.

But a simpler integration: set up Alertmanager to include links to Kibana/Loki in alerts. E.g., an alert for NodeDown could include a templated link: “See logs: http://lokiserver/grafana/explore?left=...&expr={hostname="$labels[node]"}”. This way, when an alert is received, the on-call can click the link to jump to logs of that node.

Prometheus and Loki use-case: Suppose an alert “Scheduler Cycle Time > 30s” triggers (meaning slurmctld is having trouble). The admin can check logs on the controller via Loki to find if any specific job or event is causing slurmctld issues (maybe an error like “error communicating with node X” repeating). The metrics indicated a problem, the logs detail it.

Another example: An alert “High Job Failures” triggers. The alert provides maybe a count and perhaps a sample job ID that failed. The admin can search in slurmctld log for that job ID to see why it failed (slurmctld log records job end with exit code or failure reason). Or search node logs for that job’s execution node to see a segfault or OOM message.

In integration practice, one ensures that job IDs, node names, user IDs, etc., appear in both metrics and logs. SLURM logs do contain job IDs and node names. The exporter metrics label things by user, account, partition. So you can go from “user X has lots of failing jobs” (metrics) to logs filter UID X (slurmctld usually logs user submitting or finishing job).

Thus, metrics give the “what” and logs give the “why.” A tight integration means in Grafana, within the same interface, you can select “Explore” on a panel and switch to logs mode, carrying over relevant labels/time. This greatly speeds up diagnosis compared to manually opening a separate Kibana and copying times.

Hybrid Clusters: SLURM and Kubernetes Monitoring

Increasingly, organizations may run both HPC workloads via SLURM and containerized workloads via Kubernetes (K8s), sometimes on the same hardware or at least in the same datacenter. This introduces complexity in monitoring because you effectively have two orchestrators. Some strategies and concerns:

Shared Nodes vs Partitioned Nodes: In some setups, a cluster might have a set of nodes managed by SLURM and a separate set managed by K8s (perhaps when not all workloads fit one model). This is like a partitioning of resources. Monitoring wise, you’d deploy node exporters on all nodes. For K8s nodes, you’d also deploy kube-state-metrics and other K8s exporters; for SLURM nodes, you rely on the SLURM exporter for job info. Grafana can display both: e.g., one panel showing SLURM jobs usage, another showing K8s pods usage. It might be wise to label metrics with a cluster or system label to distinguish (the Prom job name could be used to separate data sources in queries).
SLURM on Kubernetes (virtual nodes): There is interest (as seen by SchedMD’s “Slinky” project (Why Choose Slinky - SchedMD)) in running SLURM inside Kubernetes or vice versa. For example, each K8s pod could represent a SLURM node, or SLURM can allocate resources and then spawn a K8s cluster on them for certain jobs. These are advanced and not yet mainstream, but monitoring them would require bridging metrics. E.g., if SLURM job spawns a Kubernetes cluster for a workflow (using something like virtual-kubelet integration), you’d want to monitor that ephemeral K8s cluster. One might use federation: push K8s metrics into the main Prom or have a separate Prom and then federation.
Unified Monitoring Stack: Ideally, one Prometheus (or a set of them) scrapes both HPC and K8s metrics. The challenge is not to confuse similarly named metrics. For example, both SLURM exporter and Kubernetes may have a metric called job_running (just hypothetically). But if they have distinct labels (like slurm_job_running vs kube_job_status), it's okay. Proper job naming in scrape config and possibly relabeling metrics with a source label (like add scheduler="slurm" or scheduler="k8s") can help.
Use Cases: A hybrid cluster might run long-running services (databases, web services) on Kubernetes, while running MPI jobs on SLURM. Monitoring would involve standard K8s dashboards (pods CPU/mem, etc.) and HPC dashboards (jobs, nodes). The ops team needs to watch both: e.g., if a physical node goes down, it affects both systems. Prometheus can catch that via node_exporter showing node down (it would generate up{job="node_exporter", instance="node:9100"} == 0). That alert should be unified: one alert triggers, not separate for SLURM and K8s.
Opportunity for Cross data: Perhaps one could feed SLURM info into K8s or vice versa. For example, if the cluster is full of SLURM jobs, maybe don’t schedule certain K8s jobs or spin up cloud nodes. That’s more of an automation thing but could be driven by Prom metrics: e.g., if slurm_nodes_state{state="idle"} = 0 (no free nodes), and a new K8s workload needs to run, a controller might hold it or add nodes. Some advanced sites consider such elastic scenarios.
Visualization: Possibly create a high-level dashboard: overall cluster utilization (taking into account both SLURM and K8s). For instance, “Total CPUs used by SLURM + by K8s out of total”. One can sum CPU usage from SLURM metrics (which might basically be Allocated CPUs from slurm exporter) plus CPU usage from K8s (maybe number of k8s pods * their CPU requests or actual usage from cAdvisor). But careful not to double count if they share nodes: if a node is allocated to SLURM, K8s likely not running pods there and vice versa. If they do share nodes concurrently (which is unusual but possible via cgroups), then one should ensure resource quotas to avoid conflict.

From a Prometheus config perspective, you might have multiple scrape jobs: one for node exporter on all nodes, one for slurm exporter, one for kube-state-metrics, one for cadvisor (which might be via the kubelet or through Prom operator). They can all live in one Prom server. Use recording rules to combine if needed (like sum(node_cpu_seconds_total{mode!="idle", cluster="hpc"}) to see total CPU usage across all nodes, whether by K8s or HPC jobs – though that just shows OS usage, not attribution to SLURM vs K8s).

A specific integration example: some sites run Kubeflow or Argo Workflows on top of Kubernetes for machine learning, but use SLURM for big training jobs. They want a unified view of “all ML tasks running”. They could label their Prom metrics such that they can filter “show me only ML tasks on HPC vs on K8s”.

Finally, for maintainers, a hybrid cluster means two schedulers to monitor. Ensuring both are healthy is critical. Prometheus could alert on both slurmctld issues and on Kubernetes API server issues, etc. Having them in one system prevents missing an issue by only looking at one side.

In short, co-deploying Prometheus with both HPC and cloud orchestrators gives a single pane of glass for admins. They can correlate events between systems. For example, maybe a K8s batch job triggers a SLURM job through some bridge – if that pipeline stalls, one can see if either side had a problem (K8s part success metric vs slurm job pending metrics). Such holistic monitoring is increasingly valuable as HPC and cloud converge in many environments.

6. Design Challenges and Future Trends

Prometheus at Scale: Cardinality, Long-Term Storage (Thanos/Cortex), OpenTelemetry

As monitoring needs grow (in terms of metrics count, retention, and integration with tracing), new challenges and solutions arise:

Cardinality Limits: We’ve touched on the dangers of high-cardinality metrics. Prometheus itself can handle on the order of 10 million time series on a beefy server, but performance degrades as series count and query complexity grow. In extreme cases like very large Kubernetes clusters or very granular instrumentation, these limits are tested. Best practices (as discussed) mitigate many cardinality issues by design. But if you truly need to monitor something like per-customer metrics at cloud scale, you might hit a wall with one Prometheus. The trend is to use sharding or hierarchical federations for scale. Also, Cardinality management tools (some companies use automated rules to detect high-card metrics and drop them or warn developers). The community has created toolkits (like PromLens and others) to analyze metric cardinality.
Long-Term Storage: Prometheus by itself stores data for a local retention period (default 15 days, configurable). For longer retention (months/years) or larger clusters, you need external storage. Two prominent solutions are Thanos and Cortex (High Availability in Prometheus: Best Practices and Tips - Last9) (High Availability in Prometheus: Best Practices and Tips | Last9):
- Thanos: It builds on Prometheus by attaching sidecars to Prometheus instances to stream or upload historical block data to cloud storage (S3, GCS, etc.) and provides a “querier” component that can aggregate queries across multiple Prometheus + store. Essentially, Thanos allows nearly infinite retention by offloading old data to cheap storage, and enables a global query view if you have multiple Prometheis (e.g., one per datacenter). It also deduplicates data if you run Prometheus HA pairs (High Availability in Prometheus: Best Practices and Tips - Last9).
- Cortex: Cortex takes a different approach: it’s a clustered, horizontally scalable system that uses a microservice architecture (distributor, ingesters, chunk store, queriers) to ingest metrics (via Prom remote write) and store them in a NoSQL or object store. It’s like a multi-tenant, scalable Prometheus-as-a-service. Cortex can handle very high cardinality by scaling out, at the cost of complexity.
- Both are CNCF projects. They effectively allow Prometheus’s model to scale beyond one server's capacity and to keep data long-term. In an HPC context, one might not need multi-tenant at massive scale, but long-term trends (like cluster utilization year over year, or correlating metrics with changes) could be useful. Storing months of metrics from a 1000-node cluster might be heavy but feasible with Thanos (since the data volume is not as high as say a web service with millions of requests per second).
- An advantage of long-term storage: you can perform capacity planning analysis (was our peak usage this year close to capacity?), and post-mortem forensics long after an event (like investigating a failure that happened 6 months ago by looking at metrics around that time).
OpenTelemetry Integration: OpenTelemetry (OTel) is emerging as a standard for telemetry data (spans, metrics, logs). The trend is that instead of instrumenting specifically for Prometheus, developers instrument using OTel APIs, and then choose an exporter to Prometheus format if needed (OpenTelemetry vs. Prometheus & 5 Tips for Using them Together - Lumigo) (OpenTelemetry vs. Prometheus & 5 Tips for Using them Together - Lumigo). Prometheus and OTel can complement: OTel Collector can scrape Prometheus metrics as one of its receivers, process them (e.g., add metadata), and then either export to a Prom-compatible backend or another system. Conversely, Prometheus can scrape OTel SDKs that expose metrics in Prom format (there is an openmetrics plaintext format standard that OTel SDKs can emit). The convergence means we might see Prometheus evolve to work seamlessly with OTel – e.g., supporting the OTel metric data model (which is slightly different, with explicit histograms and exemplars).

Also, OTel brings traces – the ability to trace request flows. One interesting cross-over is using exemplars in Prometheus: Prom v2.26+ supports exemplars on queries, e.g., attach a trace ID from a span to a specific observation of a metric. Then in Grafana you can click a spike in latency and jump to a trace that exemplified that spike. In HPC, tracing might be less granular, but there are efforts to trace HPC workflows or even MPI communication. Possibly in the future, a job’s execution could be traced (e.g., phases of a scientific workflow) and tie into metrics like job runtime.

The main idea: OpenTelemetry provides a vendor-neutral instrumentation, and then you can funnel metrics to Prometheus or other backends. This offers flexibility – e.g., if later one wants to use a different TSDB, they could without re-instrumenting code. For HPC centers that adopt cloud technologies, using OTel for new applications while still using Prometheus for core metrics is a path forward. We already see e.g., Kubernetes adding native OTel endpoints. The Prometheus ecosystem acknowledges this and works on interoperability. For example, there’s discussion of the Prometheus server remote write to OTel collector, or OTel collector as a Prom remote write target.
High Availability of Prometheus: Running Prometheus in HA (two servers scraping same targets) is a common practice to ensure monitoring isn’t a single point of failure. However, in vanilla Prom, those two instances don’t know about each other, and each will fire alerts (duplicated). Alertmanager handles deduping if they have same alertname and labels. Thanos and Cortex can deduplicate data queries. So HA works but is not trivial if you want perfectly once-fired alerts. Alertmanager HA is easier – you can cluster Alertmanagers. The trend and best practice now is: at least two Prometheus servers (identical config) feeding into a pair of Alertmanagers in HA, perhaps fronted by Thanos Query for unified querying. This way, if one Prometheus dies, the other still alerts. This is likely relevant to HPC as well, since you wouldn’t want to lose monitoring during a critical compute run. The Last9 blog snippet confirms using two Prom servers with identical config is how they achieve HA, sometimes behind a load balancer, and mention “using tools like Thanos or Cortex to deduplicate and aggregate data” (High Availability in Prometheus: Best Practices and Tips | Last9).

SLURM Futures: Containers, ML Workloads, and Workflow Integration

The HPC scheduling world is also evolving to address new demands:

Container-Native Execution: Traditional HPC jobs run in a bare-metal environment, but increasingly users want to run containerized workloads for portability (e.g., using Singularity/Apptainer or Docker/Shifter). SLURM itself doesn’t “containerize” jobs by default, but can integrate with container runtimes. For example, it can use Singularity to launch jobs in a container image specified by the user (perhaps via a SPANK plugin that intercepts job launch to exec Singularity). The trend is that HPC centers encourage containers for complex software environments – so the job script may just call singularity exec my_image.sif my_program. From SLURM’s perspective, it’s just a process, but using container tech. SchedMD has been working on making this smoother (the Batch Script could have an option to specify a container). Also, new developments like enroot or Charliecloud for unprivileged container launch are being integrated.

The future likely sees SLURM treating container images as a first-class resource: e.g., pre-staging an image to nodes, or having GPU drivers seamlessly mapped inside. Already, NERSC’s Shifter (an early HPC container) integrated with SLURM as a prolog/epilog. Monitoring Impact: If jobs are in containers, metrics collection might shift – e.g., cAdvisor can track usage per container which could correlate to per-job usage. Perhaps Slurm’s cgroup names could incorporate job IDs and we use node exporter’s cgroup metrics to attribute usage. If container orchestrators (like Kubernetes) sit on top of SLURM (as some experiments do), then monitoring stack might combine.
ML Workloads and GPUs: Machine Learning training jobs are a major use-case in modern HPC clusters, often using many GPUs across nodes (distributed training). These jobs have some special characteristics: they can often tolerate being paused (for checkpointing) or even run in lower precision modes. HPC schedulers including SLURM are adapting: for example, support for MPS (Multi-Process Service) which allows multiple GPU jobs to share a GPU (for smaller jobs), or recently NVIDIA MIG (Multi-Instance GPU) where one physical GPU is split into slices – SLURM’s GRES now can schedule MIG partitions. The exporter shows GPU utilization metrics which is important to see if GPUs are fully utilized or sitting idle because jobs are CPU-bound (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.).

Also, ML workflows often involve hyperparameter sweeps (many similar jobs – job arrays are great for that), and pipeline of data preprocessing -> training -> postprocessing (this can be handled by SLURM job dependencies or by external workflow tools calling SLURM). We might see more integration with ML orchestrators like Ray or Dask on HPC clusters – either through Slurm backends or side-by-side. For scheduling, one challenge is gang scheduling (starting all tasks of a distributed job together). SLURM already handles that (it’s essentially how MPI jobs run). But for elastic ML jobs (that can scale up or down workers at runtime), HPC schedulers aren’t built for that dynamic resizing yet. Perhaps future versions will allow jobs to request more nodes on the fly if available.

Monitoring ML jobs can involve specific metrics (like training loss) which aren’t the scheduler’s job but could be scraped from the job’s process. There’s interest in combining such metrics with cluster metrics – e.g., to correlate GPU utilization with training throughput.
Workflow Engines and Integration: Many HPC users use workflow engines (Nextflow, Snakemake, Pegasus, Airflow, etc.) to manage complex multi-step computations. These often submit lots of jobs to SLURM under the hood (each task is a SLURM job, or they may bundle tasks). Future trends could see tighter integration: e.g., a workflow engine could communicate job plan to the scheduler to improve scheduling (if scheduler knows tasks dependencies and sizes ahead, it could do better scheduling). There’s research on co-schedulers for this.

From a monitoring side, one might want to monitor at workflow level – e.g., “Workflow X (comprised of 50 jobs) is 30% done”. That’s a higher abstraction not directly visible to SLURM unless the workflow engine exports metrics (which they can, say Nextflow can send events to a UI). Perhaps using Grafana to combine data: number of tasks completed vs total tasks (from the workflow engine’s perspective) along with SLURM job states helps track progress.
Event-Driven and Autoscaling: HPC is historically static allocation (cluster has N nodes). But cloud and container orchestration bring dynamic scaling: e.g., add nodes on demand (cloud bursting) or turn off idle nodes to save power. SLURM has some support for elastic computing – one can integrate with cloud APIs to add nodes (there’s an “elastic computing” guide using resume and suspend program scripts that Slurm calls when it wants to add/remove a node). This is likely to get more traction as HPC spans cloud resources. In fact, AWS ParallelCluster uses Slurm with such autoscaling: if queue depth is high and no free nodes, it will spin up new EC2 instances to act as Slurm nodes; if nodes idle for a time, terminate them. This process can be driven by Prometheus metrics too (e.g., a custom “autoscaler” service could read metrics or slurm’s queue via API and then call AWS APIs, but AWS already has some solution).

With Prometheus, one could implement a DIY autoscaler: an Alertmanager webhook that when alert “Low idle nodes and high pending jobs” triggers, it executes a script to provision more nodes (and conversely an alert on “many idle nodes” triggers scale-down after some buffer period). This crosses into orchestration via monitoring (common in Kubernetes with things like KEDA using Prom metrics to scale pods, and cluster-autoscaler to add nodes).

Event-driven orchestration might refer to responding to events like “data arrived, run a job” instead of on schedule. This is more a workflow matter, but could also tie into monitoring: e.g., a new file triggers an event; you could detect that via a metric (like count of files in a directory as exported by node exporter textfile) and use Alertmanager to trigger a job submission (less typical; more likely a separate watcher script).
AI for Scheduling: A speculative future trend: applying machine learning to scheduling decisions. If metrics show certain patterns (like diurnal load cycles), an AI model could predict and pre-emptively adjust scheduling policy or resource allocation. That’s outside Prometheus, but Prom data could feed such models.
Cross-System Resource Management: There’s talk of federated scheduling across HPC centers, or between SLURM and Kubernetes. For example, an HPC job might spin up a Kubernetes cluster to do a sub-task (like a data processing pipeline in Spark). Tools like Volcano (a batch scheduler for K8s) attempt to provide HPC-like scheduling in K8s. It's possible that in future, one system might schedule both container and batch jobs seamlessly – or coordinate between them.

In conclusion, the landscape is converging: HPC workload managers like SLURM are adopting features from cloud orchestrators (dynamic scaling, container support), while cloud systems are learning from HPC (e.g., batch scheduling for GPU jobs). Observability needs to keep up: it’s likely we’ll see unified monitoring frameworks (maybe OTel-based) that cover both HPC and cloud apps. Prometheus and its ecosystem are actively evolving in this direction: integration with OpenTelemetry, scaling out via Thanos/Cortex, etc., to remain the backbone of monitoring in these hybrid environments.

Finally, academic research continues on scheduling algorithms (e.g., energy-aware scheduling, or scheduling to minimize cloud cost in hybrid clusters). Implementing these could involve new metrics – like power consumption per job, or cost tracking. Indeed, if jobs can run on-prem or cloud, a scheduler might pick the cheapest option given a time constraint. Then one would monitor budget usage and efficiency.

By staying aware of these trends – containerization, hybrid cloud, advanced scheduling – early-career systems engineers can design monitoring and management solutions that are future-proof and adaptable. Both Prometheus and SLURM are mature but actively developed projects, so new features (like native histograms in Prometheus or Slurm REST API improvements or new exporter metrics) will continue to appear, enhancing what we can observe and automate in large-scale distributed systems.