Distributed Systems Observability and HPC Workload Management with Prometheus and SLURM (DR)
// Deep Research
1. Core Concepts
Monitoring Systems vs. Workload Managers
Monitoring System (Prometheus): A monitoring system continuously collects and stores metrics (measurements) from running services and infrastructure, providing visibility into their performance and health (Putting queues in front of Prometheus for reliability â Robust Perception | Prometheus Monitoring Experts). Prometheus is a prime example â itâs a time-series monitoring system that gathers numeric metrics (CPU usage, request rates, etc.) over time, stores them, and enables querying and alerting on those metrics (Putting queues in front of Prometheus for reliability â Robust Perception | Prometheus Monitoring Experts). Monitoring systems do not execute user workloads; instead, they observe and report on the state of those workloads and the underlying resources.
Workload Manager (SLURM): A workload manager (or batch scheduler) is responsible for scheduling and executing jobs (user-submitted tasks) on a cluster of compute nodes (Slurm Workload Manager - Overview). SLURM (Simple Linux Utility for Resource Management) is an HPC workload manager that allocates resources (nodes/CPUs/GPUs) to jobs, queues jobs until resources are available, and dispatches jobs to run on the cluster (Slurm Workload Manager - Overview). In simple terms, if Prometheus tells you what is happening on your systems, SLURM is one of the systems that makes things happen by running usersâ computational jobs. SLURM handles resource arbitration, enforcing policies like job priorities and fairness, whereas Prometheus handles observation, helping administrators detect issues (like high CPU, failed jobs, etc.). Both are crucial in a large cluster: SLURM keeps the machines busy with jobs, and Prometheus ensures you have insight into how those jobs and machines are performing.
Push-Based vs. Pull-Based Metric Collection
Pull Model (Prometheus): Prometheus uses a pull model for metrics â the Prometheus server periodically scrapes each target (application or exporter) by making HTTP requests to fetch the current metric values (Pull doesn't scale - or does it? | Prometheus) (Why is Prometheus Pull-Based? - DEV Community). In this model, the monitoring system initiates the connection. This has several advantages: the monitoring server knows if a target is down (scrape fails), can easily run multiple redundant servers, and one can manually inspect a targetâs metrics by visiting its endpoint (Why is Prometheus Pull-Based? - DEV Community). For example, Prometheus can scrape a node exporter on each host every 15 seconds; if a host is unreachable, Prometheus flags it as down immediately. Pull mode also avoids the risk of flooding the server with data â targets emit metrics only when scraped. Prometheusâ designers found that pulling metrics is âslightly betterâ for these reasons (Why is Prometheus Pull-Based? - DEV Community), although the difference from push is minor for most cases. Prometheus can handle tens of thousands of targets scraping in parallel, and the real bottleneck is typically processing the metric data, not initiating HTTP connections (Pull doesn't scale - or does it? | Prometheus).
Push Model: In a push-based system, the monitored applications send (push) their metrics to a central collector (often via UDP or a gateway). Systems like StatsD, Graphite, or CloudWatch use this model. Push can be useful for short-lived batch jobs or events that canât be scraped in time. Prometheus accommodates these via an intermediary Pushgateway for ephemeral jobs that terminate too quickly to be scraped (The Architecture of Prometheus. This article explains the Architecture⌠| by Ju | DevOps.dev) (The Architecture of Prometheus. This article explains the Architecture⌠| by Ju | DevOps.dev). However, a pure push model requires each application to know where to send data and can overwhelm the collector if misconfigured. Pull systems inherently get a built-in heartbeat (no scrape = problem) (Why is Prometheus Pull-Based? - DEV Community), whereas push systems often need separate health checks. In practice, many monitoring setups use a hybrid: Prometheus mostly pulls, but can pull from a Pushgateway where other apps have pushed their metrics.
Time-Series Data, Labels, and Alerting Semantics
Modern monitoring metrics are stored as time-series: streams of timestamped values for each metric (Data model | Prometheus). Prometheusâs data model is multi-dimensional: each time-series is identified by a metric name and a set of key-value labels (dimensions) (Data model | Prometheus) (Data model | Prometheus). For example, node_cpu_seconds_total{mode="idle", instance="node1"}
is a counter of idle CPU seconds for instance node1. Labels allow slicing and dicing metrics (e.g., aggregate CPU by mode or host) and are fundamental to Prometheusâs query power (Data model | Prometheus). This label-based approach contrasts with older systems that used rigid metric naming conventions â Prometheusâs flexible labels enable powerful ad-hoc queries over dimensions like service, datacenter, instance, etc. However, high-cardinality labels (like per-user IDs) should be avoided as each unique label combination produces a new time-series, exploding data stored (Metric and label naming | Prometheus).
Prometheus stores recent time-series data locally in its TSDB (Time Series Database) and uses efficient techniques (WAL and compaction) to manage it. New samples append to an in-memory buffer and a write-ahead log (WAL) on disk for durability (Storage | Prometheus). Periodically, samples are compacted into longer-term storage files (blocks), e.g., 2-hour blocks combined into larger blocks up to the retention period (Storage | Prometheus). This design makes writes fast and recoverable (WAL replay on restart) while keeping older data in compressed form. Itâs a single-node datastore by design â Prometheus doesnât cluster its storage (external systems like Thanos/Cortex are used for that), favoring simplicity and reliability of one node (Overview | Prometheus).
Alerting: On top of metric collection, a monitoring system provides alerting semantics. In Prometheus, alerts are defined by alerting rules which continuously evaluate PromQL expressions and fire when conditions are met (Alerting rules | Prometheus). For instance, an alert rule might check if CPU usage >90% for 10 minutes. When the rule condition is true, Prometheus marks an alert âfiringâ and sends it to the Alertmanager, an external component that de-duplicates alerts, applies silences/inhibitions, and routes notifications (email, PagerDuty, etc.). An alert is essentially a special time-series that becomes active when some metric condition holds (Alerting rules | Prometheus). This approach means alert logic is version-controlled and reproducible (just queries with thresholds), rather than hidden in code. Prometheusâs alerting emphasizes being timely and reliable (it may drop some metric samples under load, but tries hard not to miss firing an alert when an outage happens) (Putting queues in front of Prometheus for reliability â Robust Perception | Prometheus Monitoring Experts). In summary, monitoring systems track metrics over time and raise alerts on abnormal conditions, but they donât remediate problems themselves. Thatâs left for humans or automation systems.
Batch vs. Interactive Jobs, Queues, and Scheduling Policies in HPC
In HPC environments managed by a scheduler like SLURM, users typically run batch jobs. A batch job is a non-interactive workload submitted to a queue, often via a script with resource requirements (e.g., 16 cores for 2 hours). The job waits in a queue until the scheduler allocates the requested resources, then runs to completion. This is in contrast to interactive usage (like running commands on a login node or using an interactive allocation). HPC systems separate these for efficiency and fairness: users submit work to the scheduler instead of directly logging into compute nodes (Scheduling Basics - HPC Wiki) (Scheduling Basics - HPC Wiki). Batch jobs ensure the cluster runs at high utilization with managed scheduling, whereas direct interactive runs could conflict and overload resources.
Interactive jobs in SLURM can be run with commands like srun
or salloc
which grant a shell or run a command on allocated nodes in real-time. These are useful for debugging or running interactive applications (e.g., an interactive Jupyter notebook on compute nodes). They still go through SLURMâs allocation mechanism (just with immediate execution after allocation). In practice, HPC centers have login nodes for compiling and prepping, but computation happens via batch or interactive jobs on compute nodes that are otherwise not directly accessible (Scheduling Basics - HPC Wiki).
Scheduling Policies: SLURM uses a combination of scheduling policies to decide which job to run from the queue when resources free up. By default itâs FIFO (First-Come, First-Serve) modified by priority factors. Administrators configure a priority multifactor plugin that gives each job a priority score based on factors like waiting time (age), job size, user fair-share, quality-of-service (QoS), etc. (Slurm Workload Manager - Overview) (Slurm Workload Manager - Overview). For fairness, SLURM supports fair-share scheduling, where users or projects are assigned shares of the cluster and if someone has used more than their share recently, their new jobs get lower priority (Slurm Workload Manager - Classic Fairshare Algorithm). This prevents one user from monopolizing the cluster â over time the scheduler âbanksâ usage and biases in favor of those who have run less.
Another important HPC policy is backfill scheduling. Backfilling allows smaller jobs to run out-of-order to avoid wasting idle resources, so long as they donât delay the start of a higher-priority (often larger) job at the top of the queue (Scheduling Basics - HPC Wiki). The scheduler will reserve nodes for a big job that canât start yet (maybe waiting for more nodes to free), but in the meantime will backfill shorter jobs into the gaps. This significantly improves utilization: short jobs experience very short queue times since they can slip in, and the large job still starts as soon as its reserved resources become free (Scheduling Basics - HPC Wiki). SLURMâs scheduler can run a backfill cycle periodically to find these opportunities.
SLURM also supports preemption (high-priority jobs can force lower ones off, for example, an urgent job preempts a running job which might be requeued or canceled) and partitions (distinct sets of nodes with their own job queues and limits, analogous to multiple job queues) to enforce different policies. For instance, an âinteractiveâ partition might allow only short jobs for quick turnaround, or a âgpuâ partition ensures only GPU-equipped nodes are used for GPU jobs. Priority/Fairness settings along with partitions and QoS give administrators a rich toolset to enforce organizational policies (e.g., research groups get equal share, or certain jobs always have higher priority). The key is that HPC workload managers balance throughput, utilization, and fair access (Scheduling Basics - HPC Wiki): the scheduler tries to minimize wait time, maximize node usage (CPU/GPU not sitting idle), and ensure no single user or project unfairly hogs the cluster.
Control Plane vs. Data Plane in Cluster Systems
In distributed systems, we distinguish the control plane â which makes decisions and orchestrates â from the data plane â which carries out the actual work (Control Plane vs. Data Plane: Whatâs the Difference? | Kong Inc.). In our context:
Prometheusâs control plane is the Prometheus server: it decides when and where to scrape metrics, evaluates rules, and sends alerts. The data plane of monitoring are the instrumented applications and exporters (the targets exposing metrics). They generate and serve metric data but donât decide when to send it. In a pull model, the Prometheus server is firmly in control: it enforces the policy âscrape every 30sâ and knows who to scrape. The targets simply carry out the action of collecting metrics when asked. This separation improves reliability â you can restart Prometheus (control plane) without affecting the applications (data plane), and vice-versa (Overview | Prometheus).
SLURMâs control plane is the central scheduler daemon,
slurmctld
(and associated components like schedulers, controllers).slurmctld
makes all the decisions: which jobs to run, on which nodes, when to preempt, etc. (Slurm Workload Manager - Overview). The data plane in SLURM are the compute nodes and their local daemons (slurmd
) that actually execute the jobs (Slurm Workload Manager - Overview). Slurmctld issues orders (launch job X on node Y), and the slurmd on each node carries them out by spawning processes, enforcing resource limits, and reporting back status. If we draw an analogy, slurmctld is like an air traffic controller (control plane) and the slurmd daemons plus user jobs are the airplanes and runways (data plane). The control plane handles the policies â enforcing scheduling algorithms, managing the job queue â while the data plane (compute nodes) performs the computations. This separation is critical for scalability and fault tolerance. For example, SLURM can run with a backup controller: if the primaryslurmctld
fails (control plane failure), the backup takes over, while the running jobs on compute nodes are largely unaffected and continue (data plane continues operating) (Slurm Workload Manager - Overview). The control plane vs. data plane concept also appears in Kubernetes and other cluster managers: the control plane (Kubernetes API server & scheduler) decides where containers run, and the data plane (kubelet on nodes, the containers themselves) does the actual work. In summary, the control plane establishes and enforces policy, while the data plane carries out the tasks according to that policy (Control Plane vs. Data Plane: Whatâs the Difference? | Kong Inc.). Understanding this separation helps in debugging and design: e.g., an issue in slurmctld (control) could stall new job scheduling, but wouldnât directly kill running tasks (data); conversely, a node crash (data plane issue) is handled by the control plane noticing and requeueing that nodeâs jobs.
2. Architectural Overview
Prometheus Architecture
Prometheus follows a simple but powerful single-server architecture encompassing metric collection, storage, querying, and alerting in one cohesive system (The Architecture of Prometheus. This article explains the Architecture⌠| by Ju | DevOps.dev). The main components of a Prometheus deployment are:
Scraping Engine & Service Discovery: The Prometheus server periodically scrapes configured targets by sending HTTP requests to their
/metrics
endpoints (The Architecture of Prometheus. This article explains the Architecture⌠| by Ju | DevOps.dev). Targets can be statically configured or dynamically discovered (through integrations with service discovery systems like Kubernetes, Consul, etc.). Prometheus supports various service discovery mechanisms to automatically find targets (for example, discovering all Kubernetes pods with certain annotations). This scraping engine is highly parallel and efficient, capable of scraping tens of thousands of targets per second (Pull doesn't scale - or does it? | Prometheus). Because itâs pull-based, adding a new Prometheus instance (for redundancy or dev testing) is as easy as pointing it at the targets â each will scrape independently without instrumented apps needing to know about it (Why is Prometheus Pull-Based? - DEV Community).Time Series Database (TSDB): All scraped samples are stored locally in Prometheusâs embedded time-series database. This TSDB is optimized for write-heavy workloads and append-only access patterns. New data points are appended in memory and to the WAL (Write-Ahead Log) on disk for durability (Storage | Prometheus). Background compaction processes merge recent data into longer-term block files (by default, 2-hour blocks, which then get compacted into 1-day, etc., up to the retention limit) (Storage | Prometheus). The TSDB uses a custom storage format with efficient compression (using techniques like Gorilla for time stamps and delta-of-delta encoding for values). This allows a single Prometheus server to store millions of series and handle high ingestion rates (hundreds of thousands of samples per second in real-world deployments (Pull doesn't scale - or does it? | Prometheus)). One trade-off: the Prometheus TSDB is not clustered â each server is standalone (no automatic replication). For durability beyond one node, you rely on backing up data or using remote storage integrations. However, this design avoids complex distributed consensus and makes each Prometheus instance simpler and more robust (itâs often possible to run multiple identical Prometheus servers for HA without them needing to coordinate, and use tools like Thanos to deduplicate data externally) (Overview | Prometheus).
PromQL Query Engine: Prometheus provides a powerful query language called PromQL for on-the-fly computations and data retrieval. The PromQL engine allows selecting time-series by label filters and applying aggregations, functions, and arithmetic. For example, one can query the average CPU usage across all nodes with:
avg(rate(node_cpu_seconds_total{mode!="idle"}[5m])) by (instance)
This query calculates a 5-minute per-second CPU usage rate for each instance, excluding idle time, and then averages by instance (so essentially, average non-idle CPU per node). The engine supports many functions like
rate()
(change per second),avg_over_time
,max
, joins between metrics, etc. PromQL is designed to leverage the multi-dimensional data â you can aggregate along any label. It executes queries on data in the TSDB (and can perform subqueries to first compute an intermediate time-series over a range (PromLabs | PromQL Cheat Sheet)). For example, a subquery might compute a rolling 5m average of a metric over the past hour:rate(http_requests_total[5m])[1h:1m]
(meaning âfor each 1m interval in the last hour, compute the 5m rateâ) (PromLabs | PromQL Cheat Sheet). This is useful for smoothing and advanced alert conditions. The query engine is also used for recording rules (which precompute time-series) and alerting rules.Alertmanager: While the Prometheus server detects alert conditions, it offloads the notification and escalation logic to the Alertmanager (a separate service). Prometheus servers send firing alerts to Alertmanager, which then groups related alerts, suppresses noisy ones (for example, if many host-down alerts fire, it might group them into one notification), and routes them to channels like email, Slack, PagerDuty, etc. The Alertmanager supports silencing (e.g., silence alerts during maintenance) and inhibition (donât alert on dependent systems if a core system is down). In a typical setup, one runs a small cluster of Alertmanager instances for redundancy. This decoupling allows Prometheus to focus on metric scraping/storage, while Alertmanager handles the policy of whom to notify and how. Together, they provide a complete monitoring/alerting solution: Prometheus evaluates what is wrong, Alertmanager decides who should know and how to inform them (Alerting overview - Prometheus).
Exporters and Targets: Not part of Prometheus core, but worth mentioning: most systems expose metrics via exporters or instrumentation libraries. For instance, Node Exporter exposes Linux host metrics (CPU, mem, disk) for Prometheus to scrape (Monitoring Linux host metrics with the Node Exporter | Prometheus). There are exporters for databases (MySQL, PostgreSQL), hardware (network switches via SNMP), and more. Applications can also link Prometheus client libraries (for Go, Java, Python, etc.) to expose custom metrics. These components live on the data plane, as discussed, but are integral to the overall architecture â they are the eyes and ears feeding Prometheus data.
A simplified view of Prometheusâs architecture is: the Prometheus server scrapes metrics from instrumented jobs (or exporters), stores those samples locally, then runs rules on that data to produce aggregated series or trigger alerts (Overview | Prometheus). Users can query the data via PromQL (directly via API or through Grafana dashboards). Everything is designed to be self-contained and reliable on a single node (no external dependencies for core operation) (Overview | Prometheus). For scalability, you run multiple Prometheus servers (e.g., sharding by metric kind or environment) or use federation/remote write to hierarchical systems, but each one keeps the same simple internal architecture. Figure 1 illustrates Prometheusâs components and ecosystem at a high level (Prometheus: Monitoring at SoundCloud | SoundCloud Backstage Blog) (with optional components like Pushgateway and Grafana as external integrations).
SLURM Architecture
SLURM is built as a distributed, modular system with a central brain and per-node agents to actually execute jobs (Slurm Workload Manager - Overview). The key components of SLURMâs architecture include:
slurmctld (Control Daemon): This is the central controller that runs on the management node (or a pair for failover).
slurmctld
maintains the global state of the cluster: it knows about all nodes, all queued jobs, user accounts, partition configurations, etc. (Slurm Workload Manager - Overview). It is responsible for scheduling decisions â when a compute node becomes free, slurmctld picks the highest priority pending job that fits and allocates the node to that job. It also manages job submission, cancellation, and handles messages from nodes. There can be a backupslurmctld
to take over if the primary fails, supporting HA in the control plane (Slurm Workload Manager - Overview).slurmd (Node Daemon): This daemon runs on every compute node (the âworkerâ nodes of the cluster) (Slurm Workload Manager - Overview).
slurmd
waits for instructions from slurmctld to start or stop jobs on that node. When a user job is scheduled to a node, slurmctld communicates with that nodeâs slurmd, which then launches the jobâs processes (via a unified launcher akin to a remote shell) (Slurm Workload Manager - Overview). Slurmd sets up the jobâs cgroup for resource limits, executes the task, monitors its progress, and reports back exit status or any issues. The slurmd daemons form a communication hierarchy that is fault-tolerant â for example, they can forward messages if direct contact to slurmctld is lost, and perform some host-level error recovery (marking themselves down, etc.) (Slurm Workload Manager - Overview). In essence, slurmctld is the master scheduler, and each slurmd is the executor on a node.slurmdbd (Database Daemon): This optional component handles accounting and historical record-keeping (Slurm Workload Manager - Overview). SLURM can be configured with a central database (MySQL or MariaDB typically) to store data on job submissions, completions, resource usage, user accounts, etc.
slurmdbd
acts as a broker between slurmctld and the database â it receives accounting logs from the controller and writes them to the DB, and also serves queries (for commands likesacct
which reports on past jobs) (Slurm Workload Manager - Overview). Using slurmdbd is crucial for clusters where you need to track usage for many users over time (for example, to enforce fair share over a past usage window, or to generate utilization reports, or implement quotas/limits per account). One slurmdbd can aggregate info for multiple clusters into one database, enabling multi-cluster reporting.User Commands and APIs: Users and administrators interact with SLURM via commands (or the REST API
slurmrestd
). Important user commands include:sbatch
(submit a batch script to the queue),srun
(launch a job or job step, possibly interactive),salloc
(request an interactive allocation),squeue
(view queued/running jobs),sinfo
(view node and partition states),scancel
(cancel a job),sacct
(show accounting info for completed jobs), etc. (Slurm Workload Manager - Overview). Administrators havescontrol
(to view or modify cluster state, e.g., set a node down or adjust a jobâs priority) andsacctmgr
(to manage accounts, users, and fairshare settings in the database) (Slurm Workload Manager - Overview). These commands send requests to slurmctld, which is the authority on cluster state. For example,sbatch
will contact slurmctld to add a new pending job to the queue, andsqueue
asks slurmctld for the list of current jobs and their statuses. SLURM also has a Python/C API and the REST API for integration with external tools.Scheduler Plugins: SLURMâs scheduling behavior can be customized via plugins. The default scheduler is a simple FIFO queue with backfill (called sched/backfill and sched/basic in SLURMâs plugin system). But SLURM allows sites to choose different scheduling plugins or tune parameters. For instance, enabling the fair-share plugin will cause slurmctld to compute job priority based on user fairshare usage (Slurm Workload Manager - Overview). There are also plugins for gang scheduling (time-slicing on nodes), advanced reservations, and topology-aware scheduling (optimizing job placement based on network topology) (Slurm Workload Manager - Overview). These plugins extend the core functionality by hooking into the control daemonâs decision-making. The Multifactor Priority plugin, for example, implements a formula that combines job age, size, fairshare, etc., into a priority value (Slurm Workload Manager - Overview). The backfill scheduler runs as a separate thread that looks for jobs to start early without delaying higher priority jobs. All these operate within the slurmctld process as loadable modules, making SLURM very flexible for different HPC site policies.
Figure 2 below (SLURM Components) shows a high-level diagram of this architecture (Slurm Workload Manager - Overview). There is a central slurmctld (and optional backup), numerous slurmd daemons on each compute node, an optional slurmdbd connected to a database for accounting, and various clients (commands or API users) interacting with slurmctld. The design is purposefully decentralized for scalability: the heavy work of running jobs is distributed to the slurmds, while the central daemon focuses on scheduling logic and coordination. This has been proven to scale to clusters of tens of thousands of nodes by minimizing per-job overhead and using efficient RPCs between controller and nodes (Slurm Workload Manager - Overview). The SLURM controller and nodes exchange messages for job launch, completion, and heartbeats. If a node fails, slurmctld notes it and can requeue the nodeâs jobs or alert the admins. If slurmctld fails, the backup can take over without interrupting running jobs (which slurmd will continue to run and then later report to the new controller).
To summarize, SLURMâs architecture consists of: (1) a central brain scheduling jobs and managing state, (2) distributed agents on each node to execute and report on jobs, (3) an optional accounting database for historical data, and (4) a suite of user-facing commands/APIs to interact with the system (Slurm Workload Manager - Overview) (Slurm Workload Manager - Overview). This modular design (with plugins for different features) allows SLURM to run everything from small clusters to the worldâs largest supercomputers by adjusting components and scheduling algorithms as needed.
3. Observability and Monitoring in Depth
Instrumentation Best Practices (Counters, Gauges, Histograms)
To make systems observable with Prometheus, proper instrumentation is essential. Prometheus client libraries expose four main metric types, each with best-practice usage patterns (Instrumentation | Prometheus) (Instrumentation | Prometheus):
Counter: A counter represents a cumulative value that only increases (or resets to zero on restart). Use counters for things that monotonically increase, such as the number of requests served, tasks completed, or errors occurred. A counter should never be decremented â if you need to track decreases or values that go up and down, thatâs a gauge. Prometheus expects counters to only increase, so it can derive rates (
rate()
orincrease()
) reliably (Instrumentation | Prometheus). Example:http_requests_total
is a counter that counts total HTTP requests received. We donât decrease it when a request finishes; it just keeps incrementing for each new request. In queries, one rarely looks at the raw counter (e.g., âwe have 1,234,567 requests totalâ is less useful than ârequests per secondâ). Instead, youâd use PromQLâsrate()
orincrease()
on counters to get rates over time (Instrumentation | Prometheus). Rule of thumb: if the value can go down, itâs not a counter (Instrumentation | Prometheus).Gauge: A gauge is a metric that can arbitrarily go up and down, representing the current value of some quantity. Use gauges for measurements of a state at a point in time â e.g., current memory usage, temperature, queue length, number of active threads. Gauges are straightforward: you set them to the latest value whenever it changes. They are like an instrument gauge (speedometer) â it reflects the current reading. Example:
node_memory_bytes_free
is a gauge indicating how many bytes of RAM are free on a host right now. Gauges can be aggregated (you might sum memory across nodes) or used in alert comparisons directly (if free_memory < X). Important: never take arate()
of a gauge â since it can go down, the concept of âper-second changeâ may be meaningless or even negative (whichrate()
isnât meant for) (Instrumentation | Prometheus).Histogram: A histogram tracks the distribution of a set of observations (like request durations or message sizes) by bucketizing them. In Prometheus, a histogram is implemented as a set of counters â for each predefined bucket, it counts how many observations fell into that bucket, and also provides a total count and sum of all observations (Histograms and summaries | Prometheus). For instance, you might have a histogram metric
http_request_duration_seconds
with buckets like 0.1s, 0.5s, 1s, 5s, ... The Prometheus client will count how many requests took <=0.1s, <=0.5s, etc. This allows calculating quantiles (like 95th percentile latency) on the server side using the histogram data. Best practice: choose buckets wisely for the latency/size distribution you expect, and prefer relatively broad buckets if you will aggregate across many instances (since aggregating histograms is easy â you can sum the counts in each bucket from multiple instances). PromQL provides a functionhistogram_quantile()
to compute approximate quantiles from histogram buckets (Histograms and summaries | Prometheus). Histograms are powerful for aggregatable latency/SLO tracking, but note they create multiple time-series (one per bucket + sum + count), which can add up. As guidance, keep the number of buckets reasonable (e.g., dozens, not hundreds) and use them when you need the distribution, not just an average.Summary: A summary is similar to a histogram in that it tracks distribution (count, sum, and configurable quantile estimates), but the quantile computation is done client-side. The summary type will report pre-computed quantiles (median, 95th, etc.) in the scraped data. However, summaries have a major limitation: you cannot aggregate quantiles from multiple instances (the quantile on each instance doesnât combine into a global quantile). Histograms, on the other hand, can be aggregated (by summing bucket counts) and then quantiles can be calculated from the merged distribution (Histograms and summaries | Prometheus). Generally, summaries are recommended only in specific cases where you canât pre-define meaningful buckets or you only need per-instance (or single job) latency without cross-instance aggregation. Many sites avoid summaries due to the aggregation issue and use histograms for anything that might need cluster-wide or service-wide latency analysis (Histograms and summaries | Prometheus) (Histograms and summaries | Prometheus).
Some additional instrumentation best practices from Prometheus maintainers (Instrumentation | Prometheus) (Instrumentation | Prometheus):
Use base units (seconds, bytes) for metrics to keep consistency (Metric and label naming | Prometheus). For example, expose durations in seconds (not milliseconds) and sizes in bytes (not MB), with appropriate metric names (e.g., a counter
task_duration_seconds_total
summing seconds). This avoids confusion when comparing metrics.Do not mix different semantics in one metric. For instance, donât use one metric with a label âstateâ that can be âfreeâ or âcapacityâ and put values of different meaning in it â separate metrics are better (one for total capacity, one for current usage). A rule of thumb: the sum or average of all label values of a metric should make sense (Metric and label naming | Prometheus). If summing over a label dimension doesnât yield a meaningful value, you probably should have split metrics. For example, having a metric
queue_length{queue="X"}
is fine (sum over all queues = total items in all queues). But if you hadqueue_metric{queue="X", type="capacity|current"}
, summing capacity and current length together makes no sense â better to have two metrics:queue_capacity
andqueue_length
.Avoid high cardinality labels: As noted earlier, each unique label combination -> a time-series in the TSDB. Metrics with unbounded or high-cardinality labels (like a label for user ID, request path, or session ID) can blow up the storage and slow queries (Metric and label naming | Prometheus). Instead, consider alternatives: can you aggregate or categorize instead of labeling with ID? Perhaps export only counts of users in certain buckets, or have a separate logging system for per-user details. A notorious bad pattern would be labeling metrics with an error message or file name â there could be as many distinct values as events. If you absolutely need to expose such high-cardinality data, sometimes a log or trace is more appropriate than a metric.
Donât enforce complex relationships with labels â e.g., having two labels whose combination yields a huge cardinality (cartesian product). If you have metric
request_count{user, endpoint}
, and there are 1000 users and 1000 endpoints, that could be up to 1,000,000 series if every user hits every endpoint. If thatâs too many, consider dropping one label or encoding common groupings differently. In summary, design metrics to be expressive but also bounded in dimensionality.
By adhering to these practices, you ensure that the data Prometheus collects is both useful and efficient to process. Counters for events, gauges for states, histograms for distributions (especially for request latency or sizes which are critical for SLOs), and mindful labeling will lead to an instrumentation that can scale to millions of series without becoming a headache.
Exporters: Node Exporter, cAdvisor, and Kube-State-Metrics
Prometheusâs rich ecosystem of exporters makes it easy to monitor all parts of your stack, including third-party systems and underlying infrastructure, without modifying them. Some key exporters and integrations relevant to large-scale systems and HPC:
Node Exporter: The Prometheus Node Exporter is a lightweight agent that exposes a wide variety of Linux host metrics â CPU usage, memory, disk I/O, filesystem space, network stats, and dozens of kernel metrics (like entropy, context switches, etc.) (Monitoring Linux host metrics with the Node Exporter | Prometheus). You run node_exporter on each server; it listens on a port (default 9100) and provides metrics at
/metrics
. This exporter is fundamental for cluster monitoring â it provides the raw machine-level data. For example, metrics likenode_cpu_seconds_total
(CPU time broken down by mode),node_memory_MemAvailable_bytes
,node_filesystem_free_bytes
, etc., allow building dashboards for resource utilization and diagnosing hotspots at the hardware/OS level. In HPC context, node exporter can be augmented with cluster-specific metrics (like Infiniband network stats or Lustre filesystem stats) via textfile collectors or specialized exporters, but itâs the baseline for node health. Windows systems have an analogous Windows exporter for OS stats (Monitoring Linux host metrics with the Node Exporter | Prometheus).cAdvisor (Container Advisor): cAdvisor is an open-source tool (integrated into Kubernetes nodes or standalone) that collects metrics about containerized applications. It monitors resource usage of containers â CPU, memory, file system, network â and provides per-container metrics. cAdvisor is often used to feed Prometheus in environments like Docker or Kubernetes: each nodeâs kubelet has cAdvisor built-in exposing metrics on
/metrics/cadvisor
. In standalone form, you can run cAdvisor on a host and it will autodiscover Docker containers and expose their metrics. It is described as âcAdvisor (short for container Advisor) analyzes and exposes resource usage and performance data from running containersâ (Monitoring Amazon EKS on AWS Fargate using Prometheus and ...). For example, cAdvisor will expose metrics likecontainer_cpu_usage_seconds_total{container="X"}
for each container X, and memory, etc. This is crucial for microservice or cloud-native environments where you need to monitor inside the machine boundary at the container level. In HPC, if jobs are run inside containers (using Singularity or Docker), cAdvisor could theoretically be used, but more commonly HPC jobs are not individually containerized by Docker. However, HPC clusters might still run some system services in containers or even run a Kubernetes sidecar cluster (more on hybrid later), in which case cAdvisorâs data is important.kube-state-metrics: This is an exporter that focuses on Kubernetes object states rather than resource usage. It connects to the Kubernetes API and generates metrics about the desired and current state of things like Deployments, Pods, Nodes, etc. For example, kube-state-metrics will produce metrics such as
kube_pod_status_phase{namespace, pod, phase}
indicating how many pods are in Running/Pending, orkube_deployment_spec_replicas
vskube_deployment_status_replicas_available
. Essentially, it turns the state of K8s resources into Prometheus metrics (kube-state-metrics: Your Guide to Kubernetes Observability | Last9). This is extremely useful for alerting on conditions like âDeployment X has not met its desired replica countâ or âthere are 5 CrashLoopBackoff podsâ. In a hybrid cluster with Kubernetes, kube-state-metrics complements cAdvisor. While cAdvisor gives you resource usage of pods, kube-state-metrics gives you the orchestratorâs perspective (are pods healthy, are cronjobs succeeding, etc.). Kube-state-metrics is not directly related to SLURM, but if you manage an HPC cluster alongside cloud orchestration, youâll likely run both node exporter (for raw nodes) and kube-state-metrics (for K8s) to cover different layers. Note: Kube-state-metrics does not duplicate what node exporter or cAdvisor do â it focuses on high-level states from the API server and deliberately doesnât report resource metrics like CPU (so thereâs no conflict with cAdvisor). Itâs a great example of an exporter that isnât tied to a single process or OS, but to an API: in this case, it listens to K8s and exports metrics reflecting that state (kubernetes/kube-state-metrics: Add-on agent to generate ... - GitHub).Other Exporters: There are hundreds: for databases (e.g., PostgreSQL exporter that runs SQL queries to gather stats like xact rate, cache hits), for messaging systems (RabbitMQ, Kafka exporters), storage systems (Ceph, ElasticSearch), etc. In HPC environments, common ones include the Torque/PBS exporter if using those schedulers, or an Nvidia GPU exporter (though Prometheus can also get GPU stats via DCGM exporter or by nvidia-smi text output). Also, cluster-specific ones: for example, Infiniband network exporters to capture fabric performance counters, or power monitoring exporters. Since our focus is SLURM, one important exporter is the slurm exporter (discussed in Integration section) that exports SLURM job queue and node state metrics for Prometheus.
The guiding principle of exporters is that they bridge external systems to Prometheus by translating whatever monitoring interface those systems have (be it CLI commands, APIs, or /proc files) into Prometheus metrics exposition format. This avoids having to modify those systems. For instance, Node exporter reads lots of files from /proc
and /sys
in Linux to get its metrics, cAdvisor taps into the container runtime, and kube-state-metrics uses the K8s API. As a user of Prometheus, you just point Prometheus at these exportersâ endpoints. This provides a uniform view in Prometheus where everything is a time series with labels, even though under the hood data came from very different sources.
SLIs and SRE Principles (Choosing the Right Metrics)
In Googleâs Site Reliability Engineering (SRE) practices, a Service Level Indicator (SLI) is a carefully defined metric that reflects the quality of service â essentially, what your users care about. Typical SLIs include request latency, error rate, throughput, availability, etc. (Google SRE - Defining slo: service level objective meaning). For example, an SLI might be âthe fraction of requests that succeedâ or â99th percentile response timeâ. These tie into SLOs (objectives) and SLAs (agreements) which set targets for these indicators (e.g., 99.9% of requests under 200ms). When instrumenting systems, itâs important to capture metrics that can serve as SLIs for your services. Prometheus is often used to gather those metrics.
SREâs Golden Signals: Googleâs SRE book suggests monitoring four key signals for any service: Latency, Traffic, Errors, Saturation. Latency and errors are direct indicators of user experience (how fast? how often failures?). Traffic (e.g., QPS or requests per second) gives context of load. Saturation refers to how utilized the system is (CPU, memory, etc.), indicating how close to limits the service is. Prometheus instrumentation should ideally cover these: e.g., an HTTP service would have a latency histogram (for request duration), a counter for requests (traffic) possibly partitioned by outcome (to compute error rate), and gauges or resource metrics to measure saturation (CPU, memory from node exporter, etc.). Many off-the-shelf libraries (like client_http_server in Go or Spring Boot metrics in Java) expose such metrics.
In HPC terms, SLIs could be things like job success rate, scheduling latency (time from job submission to start), or cluster utilization. For instance, if you treat the âclusterâ as a service to users, you might have an SLI âjob start time within X minutesâ or âpercentage of nodes availableâ. However, HPC users often care about throughput (jobs per day) and fairness, which are harder to capture as single SLIs. Still, monitoring things like slurm_job_pending_time_seconds
(if such metric is exported per job) could be valuable to see if queue times are rising, indicating potential issues.
Metric design for SLIs: SRE encourages that the metrics you choose should closely measure user-facing outcomes (Google SRE - Defining slo: service level objective meaning) (Google SRE - Defining slo: service level objective meaning). For example, if you run a web portal, an SLI could be the successful page load rate. For an HPC batch system, an SLI for âjob schedulingâ might be the fraction of jobs started within their expected wait time or the failure rate of jobs due to system errors (not user errors). Itâs important to differentiate system reliability from user-caused failures. If a job fails because of a code bug (non-zero exit), that might not count against the service reliability from the SRE perspective (the HPC system was functioning). But if jobs fail due to node crashes or scheduler errors, that is an issue. Thus, one might define an SLI like â% of node-hours available vs totalâ as an availability measure of the HPC infrastructure.
Using Prometheus, you can implement these SLIs. For example, error rate SLI: define an alert on rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01
to detect if >1% requests are 5xx (server errors). Or for HPC, monitor the slurm_job_failed_count
metrics (if available) to alert if system-wide job failures spike. Googleâs SRE also emphasizes aggregation and roll-up: raw metrics need to be aggregated to meaningful levels (e.g., overall success rate vs per-instance). Prometheusâs ability to aggregate by labels helps here â you might compute SLIs via recording rules, e.g., a rule that calculates âcluster_job_success_ratioâ by dividing successful jobs by total jobs over some window, excluding user error codes.
Another principle is alert on symptoms, not causes. That means set alerts on the SLI breaches (e.g., âerror rate > 5%â or â95th latency > 300ms for 15mâ), rather than on intermediate metrics like CPU or memory unless they directly impact service. In an HPC context, a symptom alert might be âjob queue wait time above 1 hour for highest priority jobsâ as a symptom that somethingâs wrong (maybe cluster is full or scheduler stuck). The cause could be varied (bad node, etc.), but the symptom catches the user-facing impact.
Service Level Objectives (SLOs) tie into this: if your SLO is â99% of jobs complete successfullyâ, youâll watch a metric of successful job completion and ensure it stays above 99%. Prometheus can be used to track SLO compliance over time (e.g., using recording rules to compute rolling success rates). A common pattern is to calculate an âerror budget burn rateâ â how fast are you consuming the allowable errors â using PromQL, and alert if the burn rate is too high.
In summary, when designing monitoring for systems (including HPC clusters), identify your critical SLIs â latency, throughput, success rate, utilization, etc. â and ensure those are measurable via metrics. Use counters/histograms for latency and errors (like request durations, job runtime distribution, error counts), use ratios in PromQL to gauge success rates, and use gauges for saturation (like cluster utilization = used_nodes/total_nodes). By following the SRE approach, you focus on metrics that matter to reliability and user happiness, rather than drowning in irrelevant data. And by leveraging Prometheusâs label and aggregation model, you can get both high-level views and detailed breakdowns as needed for troubleshooting.
High-Cardinality Pitfalls and Metric Design Anti-Patterns
As clusters grow and systems become more complex, itâs easy to be tempted to label metrics with very granular dimensions. However, high-cardinality metrics (metrics with a huge number of unique label combinations) can severely impact Prometheusâs performance and memory use (Metric and label naming | Prometheus). Each unique label-set (time-series) consumes memory for tracking and CPU for processing queries. Letâs discuss a few pitfalls and anti-patterns:
Per-User or Per-Job Metrics: In HPC, one might try to export metrics for each job or each user. For example, a metric labeled with
job_id
for job runtime oruser_id
for usage. This quickly becomes infeasible: thousands of jobs run daily, and user IDs can be in the hundreds or thousands. If you had a metricjob_cpu_seconds{job_id=12345}
, youâd end up with as many time-series as jobs â likely overwhelming Prometheus. Instead, metrics should be aggregated or sampled. SLURMâs Prometheus exporter, for instance, does not export per-job CPU directly. It exports counts of jobs in certain states, or maybe the sum of CPU time used by all running jobs, etc., which are bounded. It does export per-user and per-account job counts (Running/Pending) (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.), but those cardinalities are equal to number of users/accounts (which is usually manageable â perhaps a few hundred). In general, avoid labeling by an ID that can have an unbounded range (job IDs increment forever). If you need to analyze specific job durations or such, that might be better done by querying the job schedulerâs accounting database or logs, not by keeping a time-series for each job in Prometheus.Dynamic Labels from Data: Another anti-pattern is including things like error message text, file paths, query strings, or other highly variable data as a label value. For example, imagine an exporter for a web server that labels a metric with the URL path of each request â there could be millions of distinct URLs (including IDs or queries), which would kill the TSDB cardinality. Or including a timestamp or UUID as a label (which obviously makes every sample unique). A real-world example: labeling a metric with Kubernetes pod name â since pod names often have unique suffixes for each deployment instance, they churn a lot. Itâs usually better to label by something like
app
orservice
which is stable, rather than each instance name.Combinatorial Explosion: Sometimes each metric alone seems fine, but combined labels blow up. For example, suppose you have a microservice metrics labeled by
endpoint
andstatus_code
. Thatâs reasonable (say 20 endpoints * 5 status classes = 100 series). But if you also adduser_tier
(free vs paid, etc.), that multiplies (2*100 = 200). Then addregion
, multiplies further (say 3 regions: 600). It can creep up. Always consider how labels multiply out. Prometheus suggests keeping cardinality of any metric under a few hundred at most (How dangerous are high-cardinality labels in Prometheus?). If you see metrics with 10k+ series, question if all those dimensions are needed simultaneously.Staleness and Sparse Metrics: Another design problem is metrics that appear and disappear dynamically (sparse metrics). For example, a metric that is only present when a certain event happens. If itâs absent normally, queries can be tricky (Prometheus treats missing series as âstaleâ). A tip from best practices: if you anticipate a series might not always exist, you can proactively export it as 0 when inactive, to avoid sparseness (Instrumentation | Prometheus). This ensures the series is always present (0 when no events), making alert logic simpler (you donât have to handle âno dataâ vs âzeroâ). The client libraries often do this for you for counters â theyâll report 0 even if nothing incremented yet (Instrumentation | Prometheus).
Overly Granular Histograms: Histograms can also cause cardinality issues if misused. For instance, having a histogram with 50 buckets is fine. But if you label that histogram by something like
route
with 100 possible values, thatâs 50*100 = 5000 series just for one metric family. Multiply by multiple histograms and you can see the cost. Make sure if you use labels on histograms, theyâre truly needed and of low variety. Or consider using one histogram per major category, not per very detailed label.
Mitigation and Patterns: If you truly need high-cardinality analysis, one approach is to offload it to a logging or tracing system. Prometheus is optimized for aggregated data. If you needed per-job info, maybe you log each jobâs runtime to Elasticsearch or a database, and use Kibana or Spark to analyze it. Prometheus could still alert on, say, âjob failure count > 0 in 5mâ, but not keep a time-series per job. Another approach is to use exemplars (a newer Prometheus feature) where you attach a trace or log ID to a few samples for context, rather than as a persistent label on all samples.
Also, if certain high-cardinality metrics are valuable but heavy, consider shorter retention or relabelling to drop or aggregate labels at scrape time. For example, you might use Prometheusâs drop/keep configurations to ignore metrics that exceed a cardinality threshold or strip certain labels you donât need.
In HPC monitoring, one must be especially careful: there are thousands of nodes, hundreds of users, thousands of jobs â metrics involving all three could theoretically generate millions of series (e.g., cpu_usage{node, user, job}
is a non-starter!). So metrics exported by slurm or node exporter donât do that. Theyâll give you node_cpu
by node and CPU mode (like idle, system, user) which is fixed small set, or slurm_jobs_running{partition}
etc. By sticking to aggregated counts (jobs per state, nodes per state, etc.), the SLURM exporter keeps cardinality manageable (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.) (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.). If you extend monitoring, say you write a script to export GPU process info on each node, be cautious not to include the process ID as a label â better to aggregate like ânumber_of_gpu_jobs_running_on_nodeâ.
In summary: High-cardinality metrics can stealthily degrade your monitoring system. Always design metrics with questions: Do I need this level of detail? Can I aggregate or bucket this? When in doubt, lean toward fewer, more general metrics and use labels that correspond to stable, bounded sets (service names, roles, categories) rather than unbounded sets (user IDs, timestamps). By doing so, you keep Prometheus healthy and queries fast, which in turn ensures you can reliably alert on and understand your systemâs behavior (Metric and label naming | Prometheus).
Advanced PromQL: Rates, Subqueries, and Metric Joins
PromQLâs expressiveness allows complex queries to support deep insights and alerting logic. Here we highlight some advanced uses: calculating rates, using subqueries for rolling windows, and joining metrics with vector matching and label manipulation.
Rate() for Counters: As mentioned,
rate(counter[5m])
is the go-to for turning a monotonically increasing counter into a per-second rate over the last 5 minutes. For example,rate(http_requests_total[1m])
gives the requests per second over the past minute. Useirate()
(instant rate) if you need the very latest two-sample rate (more spiky). Why 5m? Using a window smooths out short bursts and noise. Many official alerts use a 5m or 10m rate. E.g., an alert âHigh Request Rateâ might check ifrate(requests_total[5m])
> some threshold. Rate is fundamental for counters because raw counters are generally not meaningful without time normalization (Instrumentation | Prometheus).Increase() for Counters:
increase(counter[1h])
tells you how much the counter increased over the past hour (essentially the integral of the rate). This is great for measuring, say, âX events happened in the last hourâ. For example,increase(jobs_completed_total[1d])
could give how many jobs completed in the last 24 hours.Aggregations and Grouping: You can aggregate over any label using sum, avg, min, max, etc., and use the
by()
orwithout()
clauses. For example,sum(rate(node_cpu_seconds_total{mode!="idle"}[5m])) by (instance)
gives CPU usage per instance. If you dropby(instance)
, it would give total CPU usage across all instances (a cluster-wide number). One can also doavg by (job)
or other dimensions as needed. Grouping is key to high-level dashboards (like overall cluster CPU utilization) and SLO calculations (like percent errors = sum errors / sum total * 100).Vector Matching (Joins): PromQL allows binary operations on metrics, which effectively join two vectors of data on their labels. By default, if you do
metricA * metricB
, Prometheus will pair up time-series with exactly the same set of labels (excluding the metric name) (Left joins in PromQL â Robust Perception | Prometheus Monitoring Experts). This is akin to an inner join on all labels. Often, you need to join on a subset of labels â thatâs where theon()
orignoring()
clause andgroup_left
/group_right
come in. For example, suppose you have a metricnode_cpu_seconds_total{mode, instance}
and another metricnode_total_memory_bytes{instance}
(just as a hypothetical). If you wanted to compute CPU seconds per byte of memory per instance, you need to match by instance, but one metric hasmode
label and the other doesnât. You could do:sum(rate(node_cpu_seconds_total{mode!="idle"}[5m])) by (instance) / node_total_memory_bytes
PromQL by default will ignore the mode (because we aggregated it away) and match by instance. If labels donât match exactly, you use
on(label)
to specify which labels to match on. Conversely,ignoring(label)
will match on all labels except the ones listed (like ignoring mode). Thegroup_left
orgroup_right
modifiers handle one-to-many or many-to-one joins. For instance, if one metric is per-node and another is per-cluster, you might use group_left to attach a cluster-level label to each node metric. A concrete example from documentation:a * on(foo,bar) group_left(baz) b
means join metricsa
andb
where theirfoo
andbar
labels match;b
has an extra labelbaz
thata
doesnât, andgroup_left(baz)
means includebaz
from the right-hand side in the result (Left joins in PromQL â Robust Perception | Prometheus Monitoring Experts). It also implies that for each combination of foo,bar labels,b
may have one series anda
may have many (many-to-one join, taking the left sideâs multiple series and attaching the single right side value with its baz label to them) (Left joins in PromQL â Robust Perception | Prometheus Monitoring Experts). This is how you can, say, join a constant series of total cluster capacity to many per-node series to calculate a ratio.label_replace() and label_join(): These are functions to manipulate label values.
label_replace(vector, "new_label", "replacement", "src_label", "regex")
can copy or transform an existing label. For example, you haveinstance="host:9100"
and you want just hostname without port as a label, you could do:label_replace(up, "hostname", "$1", "instance", "([^:]+):.*")
This will create a new label
hostname
by regex extracting everything before the colon frominstance
(PromLabs | PromQL Cheat Sheet). This is useful if your label naming isnât ideal or you want to group differently.label_join()
can merge multiple label values into one. E.g., combineregion
andaz
labels into a singleregion_az
label. These functions operate per time-series in a vector and return a new vector with modified labels.Subqueries: Introduced in Prometheus 2.7, subqueries allow you to get a time-series result as a series to further aggregate. Syntax is
metric_or_expression[range:resolution]
. For example,rate(http_requests_total[5m])[1h:30s]
would produce a time-series of the 5m rate evaluated every 30s over the past hour (PromLabs | PromQL Cheat Sheet) (PromLabs | PromQL Cheat Sheet). Without subquery, you could only directly graph or alert on the current value of a rate, or use recording rules to store it. Subqueries make ad-hoc rolling window computations easier in the query itself. One common use is to compute a long window quantile from short window data. For instance, âwhat was the max 5m error rate over each 1h interval in the last dayâ â subqueries can help with that by first computing 5m error rate at some resolution, then applyingmax_over_time(...[1h])
. They are powerful but be cautious with performance (subqueries pull potentially a lot of data into memory).
PromQLâs advanced features shine in creating derived metrics and sophisticated alerts. For example, consider an alert: âHigh CPU load relative to cores availableâ. You have node_load1
(1-minute load average) and node_cpu_seconds_total
(for CPU count, you could use the count of mode="idle"+mode="user"+⌠labels or a known metric). You might do:
(node_load1{job="node"} / count(node_cpu_seconds_total{mode="idle"} by (instance)))
> 0.8
to detect if load > 80% of CPU count on a node. Here youâre dividing two metrics after matching by instance (implicitly, or explicitly with on(instance)
). This kind of algebra makes Prometheus more of a âmetrics processing languageâ than just a query filter.
Another advanced alert could be based on absent() function â for example, if a metric from a service hasnât been seen for 5 minutes, maybe the service is down. absent(up{job="myservice"} == 1)
can fire when there are no targets up for that service.
For joining metrics that donât share labels, sometimes you label one side via recording rule or static config to enable a join. For instance, if you want to attach datacenter info to each node metric, and you have an external file listing node->DC, you could create a gauge metric like node_meta{instance="foo", datacenter="east"}
= 1 for each node (perhaps push it via textfile to node exporter). Then in PromQL do node_cpu_seconds_total * on(instance) group_left(datacenter) node_meta
to get a version of node_cpu with the datacenter label.
In HPC monitoring, these techniques are useful. For example, join SLURM metrics with node exporter metrics: If SLURM exporter gives slurm_node_state{state="allocated", nodename="foo"}
as 1/0, and node exporter has node_energy_joules_total{instance="foo:9100"}
, you might align them by some label (though nodename vs instance string differences require label_replace). One could calculate âpower usage of allocated nodes vs idle nodesâ by joining the two data sets.
Overall, advanced PromQL allows you to derive insights that raw metrics alone donât show. Itâs often where the monitoring magic happens: defining alerts that combine multiple conditions (e.g., high error rate and high latency -> very bad), or creating summary metrics for dashboards (like cluster efficiency = running cores / total cores, using metrics from Slurm exporter divided by a constant total). Mastering these techniques enables writing precise alerts that reduce noise â for instance, alert on a ratio or trend rather than a single metric threshold. It also helps in capacity planning analysis, anomaly detection, and generating SLO reports from raw data.
4. Resource Scheduling in Depth
SLURM Scheduling Algorithms: FCFS, Fair-Share, Preemption, Backfill, Topology-Aware
SLURMâs scheduling can be tailored, but letâs break down common strategies:
First-Come, First-Served (FCFS): By default, SLURM will try to schedule jobs in submission order (oldest first) if resources are available. However, pure FCFS can lead to suboptimal utilization (a big job at the top might block many small ones behind it). Thatâs why SLURM rarely runs strictly FCFS; it introduces priority factors and backfill. But conceptually, jobs are sorted by a priority value (which could be just their queue time, implementing FCFS) (Scheduling Basics - HPC Wiki). If a top job canât run (waiting for enough nodes), the scheduler will look down the queue.
Priority & Fair-Share: SLURMâs multifactor priority plugin computes a priority for each pending job based on factors like job age (how long itâs been waiting), job size (resources requested), QOS level, and fair-share usage (Slurm Workload Manager - Overview) (Slurm Workload Manager - Classic Fairshare Algorithm). Fair-share means each user or account has a target share of the cluster. SLURM tracks actual historical usage (via SlurmDBD) and adjusts job priority: users who have used more than their fair share get lower priority on new jobs, and users under their share get a boost (Slurm Workload Manager - Classic Fairshare Algorithm). This ensures equitable usage over time. The fair-share algorithm often implemented in SLURM is a weighted âfair treeâ that ranks users by an assigned fair-share value (Slurm Fairshare Refresher | High-Performance Computing - NREL). Administrators can configure how heavily fair-share weighs relative to other factors (for example, you might still favor small jobs or urgent QOS jobs over fair-share at times). The priority formula is typically something like:
Priority = f(FairShare, Age, JobSize, QOS, Partition)
where each component is normalized to some range. A nice value can also be set by users to lower their own job priority (much likenice
in Linux) ([PDF] Slurm Priority, Fairshare and Fair Tree - SchedMD).Preemption: In some setups, certain jobs can preempt others. SLURM supports preemption via configuration of QOS or partitions. For example, a high-priority queue could be set to preempt lower QOS jobs (either by suspending them or killing and requeuing them) if needed to free resources. This is used in scenarios like urgent computing or to ensure short critical jobs arenât stuck behind long low-priority jobs. Preemption can work by suspension (the jobâs processes are SIGSTOPâd, freeing CPU but holding memory) or checkpoint/requeue if the job is run with checkpointing.
Backfill Scheduling: This was discussed earlier â SLURMâs scheduler will scan the queue to find jobs that can run in the âholesâ of resources while waiting for a higher priority job at the top to start (Scheduling Basics - HPC Wiki). The backfill scheduler respects job reservations for higher priority jobs. For example, suppose JobA (high priority) needs 100 nodes for 2 hours, but only 50 are free now. The scheduler might reserve 100 nodes that will free up in 1 hour for JobA (meaning JobA will start in 1 hour when those become available). Meanwhile, if there are smaller jobs down the queue that can fit into the currently free 50 nodes and finish within 1 hour (so theyâll be done by the time JobA needs them), those can be started now via backfill (Scheduling Basics - HPC Wiki). Backfill thus requires accurate job time limits (users provide
--time
for each job) â if a backfilled job exceeds its time and interferes with the reserved job, SLURM will kill it when the high-priority job starts. HPC centers often encourage reasonably accurate walltime estimates to maximize backfill efficiency. Without backfill, those 50 free nodes would sit idle for an hour waiting for the big job. With backfill, they get used.Topology-Aware Scheduling: In clusters with a non-uniform network topology (like a fat-tree network or Dragonfly where some nodes are âcloserâ to each other), SLURM can employ topology-aware scheduling (Slurm Workload Manager - Overview). This means the scheduler tries to allocate nodes for a job that are close together network-wise, to minimize communication latency for MPI jobs. SLURM can be configured with a topology plugin that knows about the switch layout. For example, if a job needs 16 nodes, topology-aware scheduling might allocate 16 nodes within the same rack or network switch group if possible, rather than 8 on one switch and 8 on a distant switch, to improve the jobâs performance. This is more of a concern on very large clusters and is usually a secondary optimization after core scheduling decisions. It can reduce network contention and is a form of packing strategy. Another angle is NUMA topology on a single node â but that is usually handled by job configurations (like
--ntasks-per-socket
options) rather than the schedulerâs global view.Partitioning: Administrators can divide the cluster into partitions (like queues) which can have different scheduling policies or usage limits. For example, a âdebugâ partition might allow only short jobs (time limit 30 min) and have priority over the general partition for quick turnaround. Or a âgpuâ partition that includes only GPU-equipped nodes. Users direct their jobs to a partition (default if not specified). Each partition can have its own scheduling parameters: some might not allow backfill or have different nice offsets, etc. Partitions can also overlap (a node can be in multiple partitions), giving flexibility (e.g., a node might be usable for either âgeneralâ or âgpuâ jobs). When a job is submitted, itâs associated with a partition, and SLURMâs scheduler essentially maintains separate queues per partition (though if partitions share nodes, jobs from either could compete for those nodes, resolved by priority). This is a way to implement policy: e.g., a partition for a specific project ensures only that projectâs jobs run on its nodes, or a preemptible partition uses cheaper nodes and lower QoS.
In operation, SLURMâs scheduler (slurmctld) runs periodically (by default every few seconds) to update priorities and attempt to start jobs. It uses the configured algorithm: sort jobs by priority, try to allocate resources, apply backfill to fill gaps, etc. There is also a concept of sched cycle and you can get stats via sdiag
(scheduler diagnostics) which shows things like how long each cycle takes and how many jobs were considered/backfilled (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.) (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.). Tuning may be required for very large systems â e.g., limiting how deep in the queue the backfill algorithm looks, to avoid spending too long computing.
Summary: SLURMâs scheduling takes into account when the job was submitted (Age/Age factor), who submitted it (Fairshare usage), what resources it needs (favoring smaller jobs for backfill or via size factor), what QoS/partition itâs in, and any dependencies or reservations. Preemption and backfill improve utilization and responsiveness for high-priority work. Topology and partitions implement site-specific policies and optimizations. The end goal is to keep the cluster busy (no idle resources unless by policy) while meeting priority rules. Admins have a lot of control: they can define QOS (Quality of Service) levels that bundle priority boosts or preemption ability, and assign jobs to QOS. For example, a âhighâ QOS could give +1000 priority but limited to a group of users or limited in number of jobs. All these knobs influence the multifactor priority pluginâs outcome (Slurm Workload Manager - Overview).
From a monitoring perspective, understanding these algorithms is important: for example, if jobs are waiting a long time, is it because fairshare is throttling someone? Or is a partition at capacity? Metrics like âpending jobs per partitionâ or âage of oldest jobâ can reveal if a scheduler policy is causing a bottleneck. SLURMâs exporter even gives some metrics like scheduling cycle time and backfilled jobs count (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.), which can indicate if backfill is actively working or if scheduler is overloaded.
Resource Containers: cgroups and Job Constraints (--mem
, --cpus-per-task
, --gres
)
SLURM ensures that when multiple jobs share a node, they donât interfere by using Linux cgroups (control groups) to constrain resources per job (Slurm Workload Manager - Control Group in Slurm). Cgroups group a jobâs processes and enforce limits on CPU, memory, and more:
CPU Enforcement: If a job requests, say, 4 CPUs, SLURM will allocate 4 CPU cores to it (possibly as 4 out of the nodeâs 32 cores) and then create a cgroup limiting the jobâs processes to those CPU cores (cpuset) (Slurm Workload Manager - Control Group in Slurm). This ensures the job cannot use other cores. It also can enforce CPU shares or quota â if jobs are time-sharing a core, cgroups can weight them. But typically in HPC, jobs get exclusive core access.
Memory Enforcement: The
--mem
option in SLURM requests a certain amount of memory per node (or--mem-per-cpu
for per CPU). SLURMâs cgroup plugin will set the jobâs cgroup memory limit to that amount (Slurm Workload Manager - cgroup.conf). If the job tries to use more, the Linux OOM killer will terminate processes in that cgroup (effectively killing the job, which SLURM will note as out-of-memory failure). This containment prevents one job from crashing a node by using all memory and pushing the OS to OOM other jobs. HPC clusters often must configure the cgroup memory plugin for reliability (Slurm Workload Manager - Control Group in Slurm), otherwise a single rogue job can consume all RAM and starve others.GPU and other Generic Resources (GRES): SLURM can manage generic resources like GPUs, MIC (Xeon Phi), or even licenses, via the GRES mechanism (Using GPUs with Slurm - Alliance Doc). If a user requests
--gres=gpu:2
, SLURM will allocate 2 GPUs on a node to that job. It uses cgroups or vendor-specific libraries (like Nvidiaâs control) to enforce that only those GPUs are visible to the jobâs processes. For example, it will setCUDA_VISIBLE_DEVICES
for the job, or use the devices cgroup to restrict GPU device files accessible (Slurm Workload Manager - Control Group in Slurm). GRES are configured ingres.conf
on each node to tell SLURM how many of each resource the node has. Modern SLURM also has--gpus
option (alias to GRES) and can schedule GPU resources more natively. The exporter metrics we saw track GPU allocation cluster-wide (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.). GRES can also cover things like high-memory nodes or special hardware (just defined as a resource with certain name).Task Affinity (cpus-per-task, etc.): When launching parallel tasks, SLURM gives options like
--cpus-per-task
(how many CPU cores each task of a job can use),--ntasks
(number of parallel tasks),--nodes
(number of nodes). These translate into how SLURM sets up cgroups on each node. For instance,--nodes=2 --ntasks=4 --cpus-per-task=2
means 4 tasks across 2 nodes (so maybe 2 tasks per node) each needing 2 CPU cores. Slurm will allocate 2 cores per task (so 4 cores per node) and isolate each taskâs process to its assigned cores via cgroup cpuset. The--threads
or--ntasks-per-core
flags further refine placement (like whether hyperthreads count as separate or not). All of these ensure that MPI ranks or multi-threaded parts of jobs get the right resources.Memory, Hugepages, etc.: SLURMâs cgroup support extends to things like huge pages, network QoS, and devices. For memory, beyond just a total limit, it can enforce NUMA bindings if a jobâs allocation spans specific NUMA nodes. But typically, just ensuring the job cannot exceed its memory allocation is the main use. The snippet from cgroup.conf indicates one can even enforce a percentage overhead or swap usage (Slurm Workload Manager - cgroup.conf).
The net effect is that each job (or in SLURM terms, each âjob stepâ if srun
launches sub-tasks) runs in a sandbox of CPU cores, memory, and devices. This is crucial on shared nodes â without it, one job could hog CPU (say run more threads than requested) or consume all memory (causing others to crash). With cgroups, if a job tries to use more CPU than allocated, the scheduler and OS simply wonât allow its threads onto other cores (theyâll be constrained to their cpuset and if fully busy, additional threads just wait). If it tries to use more memory, it gets killed by cgroup limit.
For HPC jobs that require the whole node, cgroups have less impact (the job already has exclusive node). But in many clusters, nodes are shared among smaller jobs, so this isolation is critical.
Constraint Flags: A few important SLURM resource request flags and their meaning:
--nodes
(or-N
): number of nodes requested. Can be a range (e.g.,--nodes=4-8
meaning âbetween 4 and 8 nodes, allocate as many as scheduler can â but this is rarely used unless job can scale dynamically).--ntasks
(or-n
): total number of tasks (processes) to run. If you usempirun
or SLURMâs built-in launch withsrun
, SLURM will spawn this many processes (perhaps distributed across nodes). Often combined with nodes, e.g.,--nodes=4 --ntasks=16
implies 4 tasks per node on 4 nodes.--cpus-per-task
: number of CPU cores a single task will use (e.g., if each task is multi-threaded with OpenMP, you might ask for 8 cores per task). This ensures SLURM allocates those cores together and sets OMP_NUM_THREADS=8 for example.--mem
or--mem-per-cpu
: memory allocation.--mem
is a per-node memory,--mem-per-cpu
multiplies by CPUs. They end up as a total memory cgroup limit. If not specified, a default or the whole node might be assumed, which can cause conflict if multiple jobs share the node without explicit memory splits. HPC centers often set a default per-core memory so if user doesnât specify, SLURM still divides memory proportionally.--gres
: as mentioned, generic resources like GPUs. Syntax--gres=gpu:K80:2
could request 2 Tesla K80 GPUs (if there are multiple GPU types). On output, SLURM setsGRES
environment or uses cgroup devices to give the job those GPU device files.--constraint
: not a resource quantity, but a filter for node attributes (like--constraint="[gpu&32GB]"
to only use GPU nodes with 32GB memory). This uses node feature flags.--time
: (not a hardware resource, but a limit) â the maximum runtime. The scheduler uses this for scheduling (backfill) and will kill the job if time is exceeded (enforced by slurmctld, not cgroup). Also used for fair-share decay calculations.
These flags translate to internal resource accounting that the scheduler uses to allocate and the slurmd
uses to enforce via cgroups and other OS controls (Slurm Workload Manager - Control Group in Slurm) (Slurm Workload Manager - Control Group in Slurm).
On the monitoring side, one interesting aspect is capturing when cgroup limits are hit. For instance, if jobs frequently get killed for OOM (out-of-memory), one might want to alert. SLURM logs that, and some metrics could be derived (maybe the slurm exporter could count job failures by reason). Also, node exporter can show if processes are being throttled by cgroups (some metrics like container_cpu_cfs_throttled_seconds_total
if cAdvisor was used). HPC centers sometimes run Prometheus node exporter with cgroup metrics enabled to see per-cgroup resource usage, but that can be high-cardinality (since cgroup names include job IDs by default). If needed, one could aggregate by state: e.g., show total memory used by all jobs on a node vs total memory.
The bottom line is SLURM leverages Linux kernel features to implement resource containers for jobs, very much akin to what container orchestrators (like Kubernetes) do for pods. In fact, running a job in SLURM is conceptually similar to running a Docker container with CPU/mem limits â both end up creating cgroups. HPC users typically donât notice cgroups except when they hit a limit (job killed for memory, or using more CPUs does nothing). It provides accounting accuracy too â SLURM can precisely measure a jobâs CPU time, memory max, etc., via cgroup controllers (for job accounting records).
Scalability: Job Arrays, Controller Failover, and RPC Architecture
Large HPC systems may have to manage many jobs (millions in backlog) and many nodes. SLURM has several features to handle scale:
Job Arrays: A job array is a single job submission that represents multiple similar jobs, indexed by a task ID. For example, a user can submit an array of 100 jobs with one sbatch command:
sbatch --array=1-100 myscript.sh
. This is far more efficient than 100 separate submissions. All array tasks share the same job script and resource request, but each gets anSLURM_ARRAY_TASK_ID
to differentiate (Slurm Workload Manager - Job Array Support) (Slurm Workload Manager - Job Array Support). Internally, SLURM treats them as 100 jobs with a common identifier and sub-ID, which can dramatically reduce overhead in job scheduling. The slurmctld can schedule them like normal jobs (and may dispatch many of them in one scheduler iteration if resources free up). Users often use job arrays for parameter sweeps or large sets of independent tasks (e.g., processing 1000 input files with the same code). Monitoring-wise, job arrays still appear as many jobs (e.g.,squeue
will list each task unless compressed output). But slurmctld stores them compactly. There are limits on array size (configurable, often up to 10k or 100k). Thereâs also a%N
limit where user can say âonly run N tasks of the array at onceâ (--array=1-100%10
to run max 10 concurrently) (Slurm Workload Manager - Job Array Support), which helps avoid flooding the cluster. Arrays improve scalability by lowering submission overhead and giving the scheduler hints to treat a set of jobs uniformly. Prometheus metrics from the slurm exporter might include counts of jobs in arrays, or perhaps treat each task individually â likely they just count all tasks.Controller Failover: As mentioned, you can have a backup slurmctld ready. It uses a mechanism where the primary does a checkpoint of state to disk (and/or communicates with slurmdbd), and the backup monitors the primaryâs heartbeat. If primary goes down, backup takes over the management IP/hostname (or via virtual IP or etc.). This is more about reliability than scale, but it ensures that even a very large cluster has high availability of the scheduler. Typically only one controller is active at a time (active-passive). Thereâs work on potential active-active in future, but itâs tricky to avoid conflicts. Failover is configured by
BackupController
in slurm.conf. For Prometheus, monitoring the health of slurmctld is vital â e.g., an exporter can have a metric for âslurmctld aliveâ or one could scrape slurmctldâs port. If the primary dies and backup takes over, perhaps an alert should fire or at least record that event.RPC and Communication Architecture: SLURM is designed for thousands of nodes, so the communication between slurmctld and slurmd is optimized. It uses RPC messages over sockets, often on a dedicated internal network. Slurmd daemons form a tree for certain communications (Slurm can use a hierarchical communication mode where a subset of slurmd act as intermediates for messages to reduce load on the controller) (Slurm Workload Manager - Overview). For instance, in a huge cluster, slurmctld might not directly send a message to each of 10k nodes for a job launch â instead it could send to a âcontroller hierarchyâ where maybe each rack has a representative. This is the tree communications mentioned: âslurmd daemons provide fault-tolerant hierarchical communicationsâ (Slurm Workload Manager - Overview), meaning if a node is down, messages can route around, and communication can be aggregated. This helps scalability by reducing message storms on the controller.
Additionally, SLURM is configured with message timeouts and rates to handle load. If a user submits 100k jobs (not as array), slurmctld would have to handle that many RPCs â it can become bottleneck. Thatâs one reason job arrays are encouraged. Also,
squeue
for 100k jobs can be heavy; there are options to limit such queries or require specific filters.Throughput and Tuning: Large HPC sites measure scheduler throughput (jobs per second it can start). SLURM is known to handle high throughputs, but sometimes scheduling cycle may be extended to gather multiple jobs to dispatch together for efficiency. The exporter metrics show things like âcycles per minuteâ and thread count of slurmctld (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.). Slurmctld is multi-threaded, for example it can schedule and respond to RPCs concurrently up to a thread count. If scheduling is slow (taking many seconds), metrics like
Slurm Scheduler Threads
in Grafana (which was 5 in the example dashboard) (SLURM Dashboard | Grafana Labs ) (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.) andAgent Queue Size
(messages pending to DB) help identify bottlenecks (in the example dashboard [82], scheduler threads = 5, agent queue = 0 indicates no backlog, meaning DB writes are keeping up).Job Submission Filter and SPANK Plugins: Slightly tangential, but as scale features: SLURM allows site-specific submission filters (Lua scripts or programs) to automatically adjust or reject jobs on submission (to enforce policies without manual admin). Also, SPANK plugins can run on job prolog/epilog on nodes to set up environment (like a plugin to load certain monitoring in job). These help scaling by automating management for large numbers of jobs/nodes.
In essence, SLURMâs design goals include being highly scalable (10k+ nodes, 100k+ jobs in queue) (Slurm Workload Manager - Overview). Features like job arrays and hierarchical communication address the scale of jobs and nodes respectively. Most HPC centers push these to their limits (some do millions of jobs per month and track scheduler latency). When Prometheus monitoring such a system, one could create alerts like âScheduler cycle time > thresholdâ from sdiag
metrics, to catch if the scheduler is struggling to keep up (which could happen if, say, someone floods 1e6 jobs).
Prometheus itself should be scaled accordingly â scraping a slurm exporter that lists thousands of jobs or nodes is fine (since metrics are aggregated counts, cardinality is not huge). But if one tried to export every job as a metric (bad idea, as discussed), that would hit Prometheus limits.
Accounting and Usage Tracking: slurmdbd and Job States
One of SLURMâs strengths for large clusters is robust accounting: every jobâs resource usage and state transitions can be recorded for reporting and enforcing policies:
Job State Lifecycle: A job typically goes through states: PENDING (waiting in queue) -> RUNNING (dispatched to nodes) -> COMPLETED (finished successfully) or CANCELLED/FAILED (ended with error or killed) (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.) (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.). There are also intermediate states like COMPLETING (cleanup after a job ends) (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.), SUSPENDED (paused by preemption) (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.), TIMEOUT (killed due to time limit) (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.), NODE_FAIL (failed due to node crash), etc. SLURM assigns a reason if a job canât run (e.g., PartitionDown, Dependency, etc., visible in
squeue -t pending -o %r
). From a monitoring perspective, you might want to alert if many jobs go into certain bad states (NODE_FAIL or FAILED unexpectedly).slurmdbd Accounting: When enabled, each time a job ends, slurmctld sends a record to slurmdbd which logs it into an SQL database. The record includes job ID, user, account, partition, nodes used, start time, end time, exit code, CPU seconds consumed, max RSS memory, etc. These records allow generating usage reports: for example, total core-hours used by each account per month, or average wait time for jobs in a partition, etc. Commands like
sacct
(to list jobs after they finish) andsreport
(pre-canned reports) query this database. From a metrics standpoint, you typically wouldnât import all this historical data into Prometheus (itâs too detailed). Instead, one might compute summaries externally or feed some key stats to Prometheus (like âtotal cluster hours used per dayâ as a gauge). However, the slurm exporter provides some live usage metrics: it can query slurmctld/slurmdbd for things like number of running jobs per user or account (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.), and perhaps cluster utilization.Usage Tracking for Fairshare: slurmdbd keeps track of cumulative usage (often in terms of CPU-seconds or normalized core-hours) for each account/user, which feeds the fair-share priority calculation. Fair-share essentially compares a userâs recent usage to their allocated share (can be updated periodically) (Slurm Workload Manager - Classic Fairshare Algorithm). If using fair-share, you might want to monitor the fair-share values â but those are internal. However,
sshare
command or the slurm exporterâs âShare Informationâ might provide metrics like effective fairshare score per account (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.). In the Grafana dashboard snippet, there is a âShare Informationâ section implying it collects some stats fromsshare
(GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.).Job Submit and End Rates: HPC centers might monitor how many jobs are being submitted/completed per hour, to detect anomalies (e.g., a user script going haywire submitting thousands of jobs inadvertently). While slurmdbd can provide totals, Prometheus could scrape counts of jobs in each state over time to infer rates. The slurm exporter has metrics for âTotal jobs started thanks to backfill since last startâ (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.) which is a cumulative counter of backfilled jobs. It also likely has a counter of total jobs executed. These can be used to see throughput.
User and Account Quotas: SLURM can enforce quotas like max jobs per user, or max running jobs per account, via the accounting settings. Monitoring could be set up to alert when someone hits a quota, but more often, users learn via job rejections. Still, an admin might want to know if a user is constantly hitting their limits (maybe need to adjust share). Metrics such as âpending jobs per userâ might show if someone has a large backlog (maybe hit limit, or waiting on resources).
Node State Accounting: SLURM tracks node states: IDLE, ALLOCATED, DOWN, DRAINING, MAINT, etc. (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.). These indicate cluster health and availability. For example, DOWN means the node is unreachable or had a failure; DRAINING means itâs scheduled to be taken out of service (no new jobs will start, but existing can finish); DRAINED means itâs out of service; FAIL state is also used for hardware failure detection (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.). The exporter clearly provides counts of nodes in each state (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.), which is great for a dashboard and alerts (e.g., âmore than 5 nodes down or failingâ triggers an alert, as that might mean an outage in part of the cluster). Over time, you can also track if Idle nodes are consistently high while jobs are pending, which indicates either scheduling issues or something like fragmentation (jobs canât fit into available chunks). Also tracking âMixedâ state (node partially allocated) and utilization metrics helps see how well-packed the cluster is. A high number of Idle cores alongside long queues could mean jobs have big minimum node counts that canât be satisfied or a partition misconfiguration.
In summary, SLURMâs accounting provides a rich source of cluster activity data. Not all of it is ideal for time-series metrics due to volume, but key aggregates (job counts by state, resource utilization, wait times) are. The integration of SLURM and Prometheus (through the exporter) focuses on exposing just such aggregated metrics to avoid the deluge. For example, the âStatus of the Jobsâ metrics in the exporter give a snapshot of how many jobs are in each state (Pending, Running, Suspended, Completed, etc.) (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.), and even categorized pending jobs by reason (dependencies, etc., possibly). This complements what the accounting DB would tell you after the fact by showing real-time queue status. An admin can glance at Grafana and see, for instance, 2000 jobs pending (1000 of them waiting for a dependency, 1000 for resources), 500 running, 2 nodes down, scheduler is keeping up (short cycle times), etc. All these are crucial for managing an HPC facility effectively.
5. Integration & Co-Deployment of Prometheus and SLURM
Prometheus + SLURM: Metrics Exporter and Grafana Dashboards
By integrating Prometheus monitoring with a SLURM-managed cluster, administrators can get real-time visibility into both the infrastructure and the workload scheduler. The primary integration point is the SLURM exporter for Prometheus. This is a software (often a Python or Go daemon) that queries SLURM (via commands or the Slurm REST API) and exposes metrics for Prometheus. One such exporter is available on GitHub (Prometheus exporter for performance metrics from Slurm. - GitHub) and is also referenced in Grafana dashboards (SLURM Dashboard | Grafana Labs ).
SLURM Exporter Metrics: The SLURM Prometheus exporter typically provides metrics on: cluster resource usage, job queue stats, and scheduler performance. For example, it exposes:
Cluster Node States: As discussed, metrics like
slurm_nodes_state{state="idle"}
= N,state="allocated"
= M, etc. In the Grafana âSLURM Dashboardâ, they show State of the Nodes â e.g., how many nodes are Allocated, Idle, Down, Drained, etc. over time (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.). This helps track overall usage (e.g., if most nodes are allocated vs idle) and detect problems (nodes in down/drain states).CPU/GPU Allocation: Metrics for total CPUs allocated vs idle cluster-wide (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.), total GPUs allocated vs total (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.). These essentially reflect utilization. For instance, âAllocated CPUsâ vs âIdle CPUsâ across the cluster (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.) can be plotted to see if the cluster is fully busy or if thereâs spare capacity. In a Grafana screenshot, we might see a line graph of CPUs in use vs total (the example chart âCluster Nodesâ or âState of the CPUsâ shows these values trending over time) (SLURM Dashboard | Grafana Labs ) (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.). Similarly for GPUs if present.
Job Counts: The exporter gives counts of jobs in various states (running, pending, completing, etc.) (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.). Grafana panels âSLURM Jobsâ likely show a time-series of running vs pending jobs over time (SLURM Dashboard | Grafana Labs ). You might see pending jobs spiking when many jobs are submitted, then running jobs increasing as they start. There can also be breakdowns: e.g., Running/Completed jobs vs Failed/Cancelled/Timeout jobs in separate panels for clarity (SLURM Dashboard | Grafana Labs ). Completed isnât a state while running, but maybe they show how many completed in that interval (or use increase of completed counter).
Jobs by User/Account: The exporter specifically lists âRunning/Pending/Suspended jobs per SLURM Accountâ and similarly per User (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.). This means it exports metrics like
slurm_jobs_running{account="deptA"}
= X. This is extremely useful for multi-tenant clusters: you can see which projects are consuming the cluster. Grafana can display top N users or accounts by running jobs. If one account suddenly has 90% of the jobs, maybe theyâre dominating or maybe itâs expected because others are idle. This ties into fairness monitoring.Scheduler Stats: Metrics extracted from
sdiag
(the scheduler diagnostics): things like number of scheduler threads, scheduler queue length, last cycle duration, mean cycle time, backfill cycle times, backfilled job counts (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.) (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.). In the Grafana dashboard, the âSlurm Scheduler Detailsâ panel shows numeric values likeSlurm Scheduler Threads = 5
,Agent Queue Size = 0
(meaning no backlog of messages to the DB), and graphs below show âScheduler Cyclesâ and âBackfill Scheduler Cyclesâ timings. These help ensure the scheduler is healthy. For instance, if âAgent queue sizeâ (to slurmdbd) is growing, the Grafana note says this likely indicates SlurmDBD or DB issues (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.). If âLast cycle timeâ is very high (say scheduling cycle takes 60s), that means the scheduler is overloaded deciding what to run; one might need to increase thread count or adjust config.Accounting Stats: The exporter also can export share info (from
sshare
), though Grafana default dashboard might not plot those by default. It likely provides gauges for accounts like raw shares vs usage. But this might be less commonly visualized.
Grafana Dashboards: Grafana is commonly used to visualize Prometheus data. There are community dashboards for SLURM (as indicated by Grafana Labs dashboard ID 4323 (SLURM Dashboard | Grafana Labs )). These dashboards typically have rows for:
Cluster Overview: total nodes, nodes by state, total CPUs vs allocated CPUs (perhaps as an area graph Idle vs Allocated vs Other states) (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.).
Job Queue: graph of running vs pending jobs, and possibly table of largest pending jobs or something.
Jobs by user: maybe a bar chart of current running jobs by top users.
Scheduler health: graphs of scheduler cycle time, etc., often for admins.
Partition usage: possibly separate panels per partition if cluster is split (e.g., GPU partition usage vs CPU partition).
From the snippet, the Grafana dashboard lists the metrics displayed: State of CPUs/GPUs, State of Nodes, Status of Jobs (with breakdown by Account/User), Scheduler Info, Share Info (SLURM Dashboard | Grafana Labs ). Essentially, exactly those provided by the exporter.
By deploying this, an HPC ops team gets a live view rather than relying solely on command-line tools. For example, instead of running sinfo
and eyeballing, they have a time history of how many nodes were idle over the last week, or how backlog grew during a big submission burst. This can inform if they need more hardware or if scheduling parameters need tuning (like if many nodes idle but jobs pending, maybe scheduling fragmentation issues or job sizes that don't fit current nodes).
Setting up the exporter usually means running it on the SLURM controller node (where it can query slurmctld or run commands like sinfo
, squeue
). It then listens on a port (say 8080) for Prometheus to scrape. The Prometheus scrape config for it might look like:
scrape_configs:
- job_name: slurm
static_configs:
- targets: ['slurm-master.example.com:8080']
scrape_interval: 30s
scrape_timeout: 30s
They even recommend in the docs to use 30s interval and timeout to avoid overloading slurmctld (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.) (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.), since some of these queries (like squeue for all jobs) can be heavy. A 30s interval strikes a balance between freshness and load.
In addition to cluster-wide dashboards, one could integrate with Grafanaâs ad-hoc filters or variables to drill into specific users or partitions. E.g., a variable to select an account and then show how many jobs that account has had over time.
Alerts with Prometheus: With SLURM metrics in Prometheus, we can set up alerts to catch conditions:
âToo many failed jobsâ: if
slurm_jobs_failed
increase is high, maybe node issues.âNodes downâ: if
slurm_nodes_state{state="down"} > 0
for 5 minutes, alert, as a node might have crashed.âScheduler not respondingâ: perhaps if exporter fails to scrape or if
slurm_scheduler_last_cycle
metric hasnât updated (could indicate slurmctld hung).âCluster Idle but queue non-emptyâ: if idle CPUs > 10 and pending jobs > 0 for prolonged time, that suggests jobs arenât being scheduled (could be due to dependency hold or misconfig).
âHigh pending jobs per userâ: maybe one user has > X jobs pending (perhaps indicating a flood of jobs that might need attention or an inefficiency).
âHigh utilizationâ: If cluster is constantly 100% allocated, maybe thatâs fine (expected in HPC) â or maybe alert if usage is consistently low (like under 50% for a day, indicating either a lull or a problem where jobs arenât coming in as expected).
These alerts help operators be proactive. For example, an alert for idle nodes while queue is long could catch if a partition was mistakenly left in a drained state.
Alerting on Failed Jobs, Idle Nodes, and Resource Saturation
Letâs consider some concrete alert scenarios combining Prometheus and SLURM knowledge:
Failed Jobs Alert: Not all job failures should alert ops (users often have errors in their jobs), but a spike in system-caused failures should. If a node goes bad, many jobs running there might fail (with Node_fail or ExitCode=... possibly in SLURM). The exporter might provide
slurm_jobs_failed
count. We could have an alert like: If number of FAILED jobs in the last 5 minutes > N, alert. Or specifically if any node has multiple job failures (maybe derive from logs, but metrics could track failure count by node if instrumented). Lacking direct metrics, one could approximate: if a node goes to DRAIN state because of failures, that we catch via node state metrics. So an alert could be âNode in DRAINED (error) stateâ â which often happens after repeated job failures on it.Idle Nodes with Long Queue: HPC centers want full utilization. If nodes are idle while jobs are waiting, somethingâs off. Perhaps jobs are waiting for specific resources not available on those idle nodes (like GPU jobs waiting while CPU-only nodes idle). This might be okay or not. But an alert can identify if a significant portion of the cluster is idle while there is work that could run. For example:
if (IdleNodes > 10) and (PendingJobs > 0 for 30m) then alert
. This may catch cases where maybe a misconfigured partition is preventing jobs from using idle nodes. Or simply notify that cluster is underutilized (maybe okay if it's overnight, but if persistent, maybe adjust scheduling or capacity).Excessive Queue Times / Saturation: If the cluster is saturated (100% busy) and lots of jobs are queued, users may experience long wait times. We might alert if pending jobs count stays very high for a sustained period. Or if oldest job waiting time metric exceeds a threshold. The exporter might not directly give oldest wait, but one could gauge by how long large pending list persists. Another saturation indicator: jobs being backfilled often? Actually backfill count high is good (utilizing holes). Instead, maybe alert if many high-priority jobs are pending (maybe meaning even top priority jobs canât start, implying total saturation).
Unbalanced usage / Starvation: Perhaps one account is hogging and others are starved. If fairshare is in effect, that should resolve over time, but maybe alert if some account has 0 running jobs but >100 pending for over a day (indicating they might be starved or put on hold).
Hardware issues: Based on metrics like node state or temperature. If integrated with node exporter, could alert on CPU temperature or ECC memory errors (if those are exported) per node. Not directly SLURMâs job, but part of overall monitoring.
Slurm Daemon Down: If the slurm exporter cannot scrape (Prometheus target down) or if it scrapes but indicates âslurmctld not respondingâ (some exporters might export a metric
slurm_up
akin to how node exporter hasnode_scrape_collector_success
). We should alert quickly if the scheduler is down, as that means no new jobs will start, which is an outage.Database issues: If slurmdbdâs agent queue (messages to DB) is growing, as the exporter docs note (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.), that could warrant an alert to check the database status.
Creating these alerts in Prometheus Alertmanager completes the integration: you get pager or email alerts for HPC cluster conditions just as you would for a web service.
Metric-Log Correlation (Prometheus with Loki/Elasticsearch)
Metrics give numeric overviews, but often we need to drill into logs for details â for instance, if an alert ânode downâ fires, one will want to see that nodeâs logs around the failure. By correlating Prometheus metrics with logs (via Loki or Elastic), operators can pivot from an alert to relevant logs easily.
Grafana Loki is a log aggregation system that integrates well with Prometheus. It stores logs with labels (similar to Prom labels) and allows queries for log streams. A common practice is to use consistent labels for metrics and logs, so that Grafanaâs âExploreâ can link them. For example, label logs with hostname
and metrics also have instance
or nodename
. If an alert comes in ânode foo is downâ, one can filter logs in Loki for {hostname="foo"}
and time range around the alert to see what happened (maybe hardware error messages, kernel panic, etc.). Grafana even supports linking an exemplar trace or logs to a metric datapoint if you set it up.
Elasticsearch/Kibana similarly could be used â for instance, system logs (syslog, dmesg, slurmctld logs) shipped to Elastic. If an alert âJob failure spikeâ occurs, youâd search the SLURM controller log for error lines (e.g., node communication errors).
To make correlation smoother, some integrate via Grafanaâs explore: you can have a graph of âFailed jobsâ in Grafana, click on a data point spike, and jump to logs at that timestamp with the relevant context. Loki can use the same label (like job ID or node) if those are included in log messages.
For HPC, interesting logs include:
SLURM controller log (tracks job submissions, completions, errors scheduling).
SLURM node daemons logs (if a node had an issue running a job).
System logs of nodes (hardware errors, etc.).
Application logs of jobs (though usually not aggregated centrally, unless users push them to a system).
If using something like Elastic Stack, one could index events like job completion as structured logs (via Slurm accounting or a DB hook). But that might duplicate metrics. Instead, focus on things metrics canât tell: cause of failure, performance messages.
One cool approach is linking trace IDs: for microservices, PromQL can attach a trace exemplar (like Jaeger trace ID) which Grafana can link to. In HPC, not typically applicable unless using distributed tracing for say workflow steps.
But a simpler integration: set up Alertmanager to include links to Kibana/Loki in alerts. E.g., an alert for NodeDown could include a templated link: âSee logs: http://lokiserver/grafana/explore?left=...&expr={hostname="$labels[node]"}â. This way, when an alert is received, the on-call can click the link to jump to logs of that node.
Prometheus and Loki use-case: Suppose an alert âScheduler Cycle Time > 30sâ triggers (meaning slurmctld is having trouble). The admin can check logs on the controller via Loki to find if any specific job or event is causing slurmctld issues (maybe an error like âerror communicating with node Xâ repeating). The metrics indicated a problem, the logs detail it.
Another example: An alert âHigh Job Failuresâ triggers. The alert provides maybe a count and perhaps a sample job ID that failed. The admin can search in slurmctld log for that job ID to see why it failed (slurmctld log records job end with exit code or failure reason). Or search node logs for that jobâs execution node to see a segfault or OOM message.
In integration practice, one ensures that job IDs, node names, user IDs, etc., appear in both metrics and logs. SLURM logs do contain job IDs and node names. The exporter metrics label things by user, account, partition. So you can go from âuser X has lots of failing jobsâ (metrics) to logs filter UID X
(slurmctld usually logs user submitting or finishing job).
Thus, metrics give the âwhatâ and logs give the âwhy.â A tight integration means in Grafana, within the same interface, you can select âExploreâ on a panel and switch to logs mode, carrying over relevant labels/time. This greatly speeds up diagnosis compared to manually opening a separate Kibana and copying times.
Hybrid Clusters: SLURM and Kubernetes Monitoring
Increasingly, organizations may run both HPC workloads via SLURM and containerized workloads via Kubernetes (K8s), sometimes on the same hardware or at least in the same datacenter. This introduces complexity in monitoring because you effectively have two orchestrators. Some strategies and concerns:
Shared Nodes vs Partitioned Nodes: In some setups, a cluster might have a set of nodes managed by SLURM and a separate set managed by K8s (perhaps when not all workloads fit one model). This is like a partitioning of resources. Monitoring wise, youâd deploy node exporters on all nodes. For K8s nodes, youâd also deploy kube-state-metrics and other K8s exporters; for SLURM nodes, you rely on the SLURM exporter for job info. Grafana can display both: e.g., one panel showing SLURM jobs usage, another showing K8s pods usage. It might be wise to label metrics with a cluster or system label to distinguish (the Prom job name could be used to separate data sources in queries).
SLURM on Kubernetes (virtual nodes): There is interest (as seen by SchedMDâs âSlinkyâ project (Why Choose Slinky - SchedMD)) in running SLURM inside Kubernetes or vice versa. For example, each K8s pod could represent a SLURM node, or SLURM can allocate resources and then spawn a K8s cluster on them for certain jobs. These are advanced and not yet mainstream, but monitoring them would require bridging metrics. E.g., if SLURM job spawns a Kubernetes cluster for a workflow (using something like virtual-kubelet integration), youâd want to monitor that ephemeral K8s cluster. One might use federation: push K8s metrics into the main Prom or have a separate Prom and then federation.
Unified Monitoring Stack: Ideally, one Prometheus (or a set of them) scrapes both HPC and K8s metrics. The challenge is not to confuse similarly named metrics. For example, both SLURM exporter and Kubernetes may have a metric called
job_running
(just hypothetically). But if they have distinct labels (likeslurm_job_running
vskube_job_status
), it's okay. Proper job naming in scrape config and possibly relabeling metrics with a source label (like addscheduler="slurm"
orscheduler="k8s"
) can help.Use Cases: A hybrid cluster might run long-running services (databases, web services) on Kubernetes, while running MPI jobs on SLURM. Monitoring would involve standard K8s dashboards (pods CPU/mem, etc.) and HPC dashboards (jobs, nodes). The ops team needs to watch both: e.g., if a physical node goes down, it affects both systems. Prometheus can catch that via node_exporter showing node down (it would generate
up{job="node_exporter", instance="node:9100"} == 0
). That alert should be unified: one alert triggers, not separate for SLURM and K8s.Opportunity for Cross data: Perhaps one could feed SLURM info into K8s or vice versa. For example, if the cluster is full of SLURM jobs, maybe donât schedule certain K8s jobs or spin up cloud nodes. Thatâs more of an automation thing but could be driven by Prom metrics: e.g., if
slurm_nodes_state{state="idle"} = 0
(no free nodes), and a new K8s workload needs to run, a controller might hold it or add nodes. Some advanced sites consider such elastic scenarios.Visualization: Possibly create a high-level dashboard: overall cluster utilization (taking into account both SLURM and K8s). For instance, âTotal CPUs used by SLURM + by K8s out of totalâ. One can sum CPU usage from SLURM metrics (which might basically be Allocated CPUs from slurm exporter) plus CPU usage from K8s (maybe number of k8s pods * their CPU requests or actual usage from cAdvisor). But careful not to double count if they share nodes: if a node is allocated to SLURM, K8s likely not running pods there and vice versa. If they do share nodes concurrently (which is unusual but possible via cgroups), then one should ensure resource quotas to avoid conflict.
From a Prometheus config perspective, you might have multiple scrape jobs: one for node exporter on all nodes, one for slurm exporter, one for kube-state-metrics, one for cadvisor (which might be via the kubelet or through Prom operator). They can all live in one Prom server. Use recording rules to combine if needed (like sum(node_cpu_seconds_total{mode!="idle", cluster="hpc"})
to see total CPU usage across all nodes, whether by K8s or HPC jobs â though that just shows OS usage, not attribution to SLURM vs K8s).
A specific integration example: some sites run Kubeflow or Argo Workflows on top of Kubernetes for machine learning, but use SLURM for big training jobs. They want a unified view of âall ML tasks runningâ. They could label their Prom metrics such that they can filter âshow me only ML tasks on HPC vs on K8sâ.
Finally, for maintainers, a hybrid cluster means two schedulers to monitor. Ensuring both are healthy is critical. Prometheus could alert on both slurmctld issues and on Kubernetes API server issues, etc. Having them in one system prevents missing an issue by only looking at one side.
In short, co-deploying Prometheus with both HPC and cloud orchestrators gives a single pane of glass for admins. They can correlate events between systems. For example, maybe a K8s batch job triggers a SLURM job through some bridge â if that pipeline stalls, one can see if either side had a problem (K8s part success metric vs slurm job pending metrics). Such holistic monitoring is increasingly valuable as HPC and cloud converge in many environments.
6. Design Challenges and Future Trends
Prometheus at Scale: Cardinality, Long-Term Storage (Thanos/Cortex), OpenTelemetry
As monitoring needs grow (in terms of metrics count, retention, and integration with tracing), new challenges and solutions arise:
Cardinality Limits: Weâve touched on the dangers of high-cardinality metrics. Prometheus itself can handle on the order of 10 million time series on a beefy server, but performance degrades as series count and query complexity grow. In extreme cases like very large Kubernetes clusters or very granular instrumentation, these limits are tested. Best practices (as discussed) mitigate many cardinality issues by design. But if you truly need to monitor something like per-customer metrics at cloud scale, you might hit a wall with one Prometheus. The trend is to use sharding or hierarchical federations for scale. Also, Cardinality management tools (some companies use automated rules to detect high-card metrics and drop them or warn developers). The community has created toolkits (like PromLens and others) to analyze metric cardinality.
Long-Term Storage: Prometheus by itself stores data for a local retention period (default 15 days, configurable). For longer retention (months/years) or larger clusters, you need external storage. Two prominent solutions are Thanos and Cortex (High Availability in Prometheus: Best Practices and Tips - Last9) (High Availability in Prometheus: Best Practices and Tips | Last9):
Thanos: It builds on Prometheus by attaching sidecars to Prometheus instances to stream or upload historical block data to cloud storage (S3, GCS, etc.) and provides a âquerierâ component that can aggregate queries across multiple Prometheus + store. Essentially, Thanos allows nearly infinite retention by offloading old data to cheap storage, and enables a global query view if you have multiple Prometheis (e.g., one per datacenter). It also deduplicates data if you run Prometheus HA pairs (High Availability in Prometheus: Best Practices and Tips - Last9).
Cortex: Cortex takes a different approach: itâs a clustered, horizontally scalable system that uses a microservice architecture (distributor, ingesters, chunk store, queriers) to ingest metrics (via Prom remote write) and store them in a NoSQL or object store. Itâs like a multi-tenant, scalable Prometheus-as-a-service. Cortex can handle very high cardinality by scaling out, at the cost of complexity.
Both are CNCF projects. They effectively allow Prometheusâs model to scale beyond one server's capacity and to keep data long-term. In an HPC context, one might not need multi-tenant at massive scale, but long-term trends (like cluster utilization year over year, or correlating metrics with changes) could be useful. Storing months of metrics from a 1000-node cluster might be heavy but feasible with Thanos (since the data volume is not as high as say a web service with millions of requests per second).
An advantage of long-term storage: you can perform capacity planning analysis (was our peak usage this year close to capacity?), and post-mortem forensics long after an event (like investigating a failure that happened 6 months ago by looking at metrics around that time).
OpenTelemetry Integration: OpenTelemetry (OTel) is emerging as a standard for telemetry data (spans, metrics, logs). The trend is that instead of instrumenting specifically for Prometheus, developers instrument using OTel APIs, and then choose an exporter to Prometheus format if needed (OpenTelemetry vs. Prometheus & 5 Tips for Using them Together - Lumigo) (OpenTelemetry vs. Prometheus & 5 Tips for Using them Together - Lumigo). Prometheus and OTel can complement: OTel Collector can scrape Prometheus metrics as one of its receivers, process them (e.g., add metadata), and then either export to a Prom-compatible backend or another system. Conversely, Prometheus can scrape OTel SDKs that expose metrics in Prom format (there is an openmetrics plaintext format standard that OTel SDKs can emit). The convergence means we might see Prometheus evolve to work seamlessly with OTel â e.g., supporting the OTel metric data model (which is slightly different, with explicit histograms and exemplars).
Also, OTel brings traces â the ability to trace request flows. One interesting cross-over is using exemplars in Prometheus: Prom v2.26+ supports exemplars on queries, e.g., attach a trace ID from a span to a specific observation of a metric. Then in Grafana you can click a spike in latency and jump to a trace that exemplified that spike. In HPC, tracing might be less granular, but there are efforts to trace HPC workflows or even MPI communication. Possibly in the future, a jobâs execution could be traced (e.g., phases of a scientific workflow) and tie into metrics like job runtime.
The main idea: OpenTelemetry provides a vendor-neutral instrumentation, and then you can funnel metrics to Prometheus or other backends. This offers flexibility â e.g., if later one wants to use a different TSDB, they could without re-instrumenting code. For HPC centers that adopt cloud technologies, using OTel for new applications while still using Prometheus for core metrics is a path forward. We already see e.g., Kubernetes adding native OTel endpoints. The Prometheus ecosystem acknowledges this and works on interoperability. For example, thereâs discussion of the Prometheus server remote write to OTel collector, or OTel collector as a Prom remote write target.
High Availability of Prometheus: Running Prometheus in HA (two servers scraping same targets) is a common practice to ensure monitoring isnât a single point of failure. However, in vanilla Prom, those two instances donât know about each other, and each will fire alerts (duplicated). Alertmanager handles deduping if they have same alertname and labels. Thanos and Cortex can deduplicate data queries. So HA works but is not trivial if you want perfectly once-fired alerts. Alertmanager HA is easier â you can cluster Alertmanagers. The trend and best practice now is: at least two Prometheus servers (identical config) feeding into a pair of Alertmanagers in HA, perhaps fronted by Thanos Query for unified querying. This way, if one Prometheus dies, the other still alerts. This is likely relevant to HPC as well, since you wouldnât want to lose monitoring during a critical compute run. The Last9 blog snippet confirms using two Prom servers with identical config is how they achieve HA, sometimes behind a load balancer, and mention âusing tools like Thanos or Cortex to deduplicate and aggregate dataâ (High Availability in Prometheus: Best Practices and Tips | Last9).
SLURM Futures: Containers, ML Workloads, and Workflow Integration
The HPC scheduling world is also evolving to address new demands:
Container-Native Execution: Traditional HPC jobs run in a bare-metal environment, but increasingly users want to run containerized workloads for portability (e.g., using Singularity/Apptainer or Docker/Shifter). SLURM itself doesnât âcontainerizeâ jobs by default, but can integrate with container runtimes. For example, it can use Singularity to launch jobs in a container image specified by the user (perhaps via a SPANK plugin that intercepts job launch to exec Singularity). The trend is that HPC centers encourage containers for complex software environments â so the job script may just call
singularity exec my_image.sif my_program
. From SLURMâs perspective, itâs just a process, but using container tech. SchedMD has been working on making this smoother (the Batch Script could have an option to specify a container). Also, new developments like enroot or Charliecloud for unprivileged container launch are being integrated.The future likely sees SLURM treating container images as a first-class resource: e.g., pre-staging an image to nodes, or having GPU drivers seamlessly mapped inside. Already, NERSCâs Shifter (an early HPC container) integrated with SLURM as a prolog/epilog. Monitoring Impact: If jobs are in containers, metrics collection might shift â e.g., cAdvisor can track usage per container which could correlate to per-job usage. Perhaps Slurmâs cgroup names could incorporate job IDs and we use node exporterâs cgroup metrics to attribute usage. If container orchestrators (like Kubernetes) sit on top of SLURM (as some experiments do), then monitoring stack might combine.
ML Workloads and GPUs: Machine Learning training jobs are a major use-case in modern HPC clusters, often using many GPUs across nodes (distributed training). These jobs have some special characteristics: they can often tolerate being paused (for checkpointing) or even run in lower precision modes. HPC schedulers including SLURM are adapting: for example, support for MPS (Multi-Process Service) which allows multiple GPU jobs to share a GPU (for smaller jobs), or recently NVIDIA MIG (Multi-Instance GPU) where one physical GPU is split into slices â SLURMâs GRES now can schedule MIG partitions. The exporter shows GPU utilization metrics which is important to see if GPUs are fully utilized or sitting idle because jobs are CPU-bound (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.).
Also, ML workflows often involve hyperparameter sweeps (many similar jobs â job arrays are great for that), and pipeline of data preprocessing -> training -> postprocessing (this can be handled by SLURM job dependencies or by external workflow tools calling SLURM). We might see more integration with ML orchestrators like Ray or Dask on HPC clusters â either through Slurm backends or side-by-side. For scheduling, one challenge is gang scheduling (starting all tasks of a distributed job together). SLURM already handles that (itâs essentially how MPI jobs run). But for elastic ML jobs (that can scale up or down workers at runtime), HPC schedulers arenât built for that dynamic resizing yet. Perhaps future versions will allow jobs to request more nodes on the fly if available.
Monitoring ML jobs can involve specific metrics (like training loss) which arenât the schedulerâs job but could be scraped from the jobâs process. Thereâs interest in combining such metrics with cluster metrics â e.g., to correlate GPU utilization with training throughput.
Workflow Engines and Integration: Many HPC users use workflow engines (Nextflow, Snakemake, Pegasus, Airflow, etc.) to manage complex multi-step computations. These often submit lots of jobs to SLURM under the hood (each task is a SLURM job, or they may bundle tasks). Future trends could see tighter integration: e.g., a workflow engine could communicate job plan to the scheduler to improve scheduling (if scheduler knows tasks dependencies and sizes ahead, it could do better scheduling). Thereâs research on co-schedulers for this.
From a monitoring side, one might want to monitor at workflow level â e.g., âWorkflow X (comprised of 50 jobs) is 30% doneâ. Thatâs a higher abstraction not directly visible to SLURM unless the workflow engine exports metrics (which they can, say Nextflow can send events to a UI). Perhaps using Grafana to combine data: number of tasks completed vs total tasks (from the workflow engineâs perspective) along with SLURM job states helps track progress.
Event-Driven and Autoscaling: HPC is historically static allocation (cluster has N nodes). But cloud and container orchestration bring dynamic scaling: e.g., add nodes on demand (cloud bursting) or turn off idle nodes to save power. SLURM has some support for elastic computing â one can integrate with cloud APIs to add nodes (thereâs an âelastic computingâ guide using resume and suspend program scripts that Slurm calls when it wants to add/remove a node). This is likely to get more traction as HPC spans cloud resources. In fact, AWS ParallelCluster uses Slurm with such autoscaling: if queue depth is high and no free nodes, it will spin up new EC2 instances to act as Slurm nodes; if nodes idle for a time, terminate them. This process can be driven by Prometheus metrics too (e.g., a custom âautoscalerâ service could read metrics or slurmâs queue via API and then call AWS APIs, but AWS already has some solution).
With Prometheus, one could implement a DIY autoscaler: an Alertmanager webhook that when alert âLow idle nodes and high pending jobsâ triggers, it executes a script to provision more nodes (and conversely an alert on âmany idle nodesâ triggers scale-down after some buffer period). This crosses into orchestration via monitoring (common in Kubernetes with things like KEDA using Prom metrics to scale pods, and cluster-autoscaler to add nodes).
Event-driven orchestration might refer to responding to events like âdata arrived, run a jobâ instead of on schedule. This is more a workflow matter, but could also tie into monitoring: e.g., a new file triggers an event; you could detect that via a metric (like count of files in a directory as exported by node exporter textfile) and use Alertmanager to trigger a job submission (less typical; more likely a separate watcher script).
AI for Scheduling: A speculative future trend: applying machine learning to scheduling decisions. If metrics show certain patterns (like diurnal load cycles), an AI model could predict and pre-emptively adjust scheduling policy or resource allocation. Thatâs outside Prometheus, but Prom data could feed such models.
Cross-System Resource Management: Thereâs talk of federated scheduling across HPC centers, or between SLURM and Kubernetes. For example, an HPC job might spin up a Kubernetes cluster to do a sub-task (like a data processing pipeline in Spark). Tools like Volcano (a batch scheduler for K8s) attempt to provide HPC-like scheduling in K8s. It's possible that in future, one system might schedule both container and batch jobs seamlessly â or coordinate between them.
In conclusion, the landscape is converging: HPC workload managers like SLURM are adopting features from cloud orchestrators (dynamic scaling, container support), while cloud systems are learning from HPC (e.g., batch scheduling for GPU jobs). Observability needs to keep up: itâs likely weâll see unified monitoring frameworks (maybe OTel-based) that cover both HPC and cloud apps. Prometheus and its ecosystem are actively evolving in this direction: integration with OpenTelemetry, scaling out via Thanos/Cortex, etc., to remain the backbone of monitoring in these hybrid environments.
Finally, academic research continues on scheduling algorithms (e.g., energy-aware scheduling, or scheduling to minimize cloud cost in hybrid clusters). Implementing these could involve new metrics â like power consumption per job, or cost tracking. Indeed, if jobs can run on-prem or cloud, a scheduler might pick the cheapest option given a time constraint. Then one would monitor budget usage and efficiency.
By staying aware of these trends â containerization, hybrid cloud, advanced scheduling â early-career systems engineers can design monitoring and management solutions that are future-proof and adaptable. Both Prometheus and SLURM are mature but actively developed projects, so new features (like native histograms in Prometheus or Slurm REST API improvements or new exporter metrics) will continue to appear, enhancing what we can observe and automate in large-scale distributed systems.