vesicularia

Distributed Systems Observability and HPC Workload Management with Prometheus and SLURM (DR)

// Deep Research

1. Core Concepts

Monitoring Systems vs. Workload Managers

Monitoring System (Prometheus): A monitoring system continuously collects and stores metrics (measurements) from running services and infrastructure, providing visibility into their performance and health (Putting queues in front of Prometheus for reliability – Robust Perception | Prometheus Monitoring Experts). Prometheus is a prime example – it’s a time-series monitoring system that gathers numeric metrics (CPU usage, request rates, etc.) over time, stores them, and enables querying and alerting on those metrics (Putting queues in front of Prometheus for reliability – Robust Perception | Prometheus Monitoring Experts). Monitoring systems do not execute user workloads; instead, they observe and report on the state of those workloads and the underlying resources.

Workload Manager (SLURM): A workload manager (or batch scheduler) is responsible for scheduling and executing jobs (user-submitted tasks) on a cluster of compute nodes (Slurm Workload Manager - Overview). SLURM (Simple Linux Utility for Resource Management) is an HPC workload manager that allocates resources (nodes/CPUs/GPUs) to jobs, queues jobs until resources are available, and dispatches jobs to run on the cluster (Slurm Workload Manager - Overview). In simple terms, if Prometheus tells you what is happening on your systems, SLURM is one of the systems that makes things happen by running users’ computational jobs. SLURM handles resource arbitration, enforcing policies like job priorities and fairness, whereas Prometheus handles observation, helping administrators detect issues (like high CPU, failed jobs, etc.). Both are crucial in a large cluster: SLURM keeps the machines busy with jobs, and Prometheus ensures you have insight into how those jobs and machines are performing.

Push-Based vs. Pull-Based Metric Collection

Pull Model (Prometheus): Prometheus uses a pull model for metrics – the Prometheus server periodically scrapes each target (application or exporter) by making HTTP requests to fetch the current metric values (Pull doesn't scale - or does it? | Prometheus) (Why is Prometheus Pull-Based? - DEV Community). In this model, the monitoring system initiates the connection. This has several advantages: the monitoring server knows if a target is down (scrape fails), can easily run multiple redundant servers, and one can manually inspect a target’s metrics by visiting its endpoint (Why is Prometheus Pull-Based? - DEV Community). For example, Prometheus can scrape a node exporter on each host every 15 seconds; if a host is unreachable, Prometheus flags it as down immediately. Pull mode also avoids the risk of flooding the server with data – targets emit metrics only when scraped. Prometheus’ designers found that pulling metrics is “slightly better” for these reasons (Why is Prometheus Pull-Based? - DEV Community), although the difference from push is minor for most cases. Prometheus can handle tens of thousands of targets scraping in parallel, and the real bottleneck is typically processing the metric data, not initiating HTTP connections (Pull doesn't scale - or does it? | Prometheus).

Push Model: In a push-based system, the monitored applications send (push) their metrics to a central collector (often via UDP or a gateway). Systems like StatsD, Graphite, or CloudWatch use this model. Push can be useful for short-lived batch jobs or events that can’t be scraped in time. Prometheus accommodates these via an intermediary Pushgateway for ephemeral jobs that terminate too quickly to be scraped (The Architecture of Prometheus. This article explains the Architecture… | by Ju | DevOps.dev) (The Architecture of Prometheus. This article explains the Architecture… | by Ju | DevOps.dev). However, a pure push model requires each application to know where to send data and can overwhelm the collector if misconfigured. Pull systems inherently get a built-in heartbeat (no scrape = problem) (Why is Prometheus Pull-Based? - DEV Community), whereas push systems often need separate health checks. In practice, many monitoring setups use a hybrid: Prometheus mostly pulls, but can pull from a Pushgateway where other apps have pushed their metrics.

Time-Series Data, Labels, and Alerting Semantics

Modern monitoring metrics are stored as time-series: streams of timestamped values for each metric (Data model | Prometheus). Prometheus’s data model is multi-dimensional: each time-series is identified by a metric name and a set of key-value labels (dimensions) (Data model | Prometheus) (Data model | Prometheus). For example, node_cpu_seconds_total{mode="idle", instance="node1"} is a counter of idle CPU seconds for instance node1. Labels allow slicing and dicing metrics (e.g., aggregate CPU by mode or host) and are fundamental to Prometheus’s query power (Data model | Prometheus). This label-based approach contrasts with older systems that used rigid metric naming conventions – Prometheus’s flexible labels enable powerful ad-hoc queries over dimensions like service, datacenter, instance, etc. However, high-cardinality labels (like per-user IDs) should be avoided as each unique label combination produces a new time-series, exploding data stored (Metric and label naming | Prometheus).

Prometheus stores recent time-series data locally in its TSDB (Time Series Database) and uses efficient techniques (WAL and compaction) to manage it. New samples append to an in-memory buffer and a write-ahead log (WAL) on disk for durability (Storage | Prometheus). Periodically, samples are compacted into longer-term storage files (blocks), e.g., 2-hour blocks combined into larger blocks up to the retention period (Storage | Prometheus). This design makes writes fast and recoverable (WAL replay on restart) while keeping older data in compressed form. It’s a single-node datastore by design – Prometheus doesn’t cluster its storage (external systems like Thanos/Cortex are used for that), favoring simplicity and reliability of one node (Overview | Prometheus).

Alerting: On top of metric collection, a monitoring system provides alerting semantics. In Prometheus, alerts are defined by alerting rules which continuously evaluate PromQL expressions and fire when conditions are met (Alerting rules | Prometheus). For instance, an alert rule might check if CPU usage >90% for 10 minutes. When the rule condition is true, Prometheus marks an alert “firing” and sends it to the Alertmanager, an external component that de-duplicates alerts, applies silences/inhibitions, and routes notifications (email, PagerDuty, etc.). An alert is essentially a special time-series that becomes active when some metric condition holds (Alerting rules | Prometheus). This approach means alert logic is version-controlled and reproducible (just queries with thresholds), rather than hidden in code. Prometheus’s alerting emphasizes being timely and reliable (it may drop some metric samples under load, but tries hard not to miss firing an alert when an outage happens) (Putting queues in front of Prometheus for reliability – Robust Perception | Prometheus Monitoring Experts). In summary, monitoring systems track metrics over time and raise alerts on abnormal conditions, but they don’t remediate problems themselves. That’s left for humans or automation systems.

Batch vs. Interactive Jobs, Queues, and Scheduling Policies in HPC

In HPC environments managed by a scheduler like SLURM, users typically run batch jobs. A batch job is a non-interactive workload submitted to a queue, often via a script with resource requirements (e.g., 16 cores for 2 hours). The job waits in a queue until the scheduler allocates the requested resources, then runs to completion. This is in contrast to interactive usage (like running commands on a login node or using an interactive allocation). HPC systems separate these for efficiency and fairness: users submit work to the scheduler instead of directly logging into compute nodes (Scheduling Basics - HPC Wiki) (Scheduling Basics - HPC Wiki). Batch jobs ensure the cluster runs at high utilization with managed scheduling, whereas direct interactive runs could conflict and overload resources.

Interactive jobs in SLURM can be run with commands like srun or salloc which grant a shell or run a command on allocated nodes in real-time. These are useful for debugging or running interactive applications (e.g., an interactive Jupyter notebook on compute nodes). They still go through SLURM’s allocation mechanism (just with immediate execution after allocation). In practice, HPC centers have login nodes for compiling and prepping, but computation happens via batch or interactive jobs on compute nodes that are otherwise not directly accessible (Scheduling Basics - HPC Wiki).

Scheduling Policies: SLURM uses a combination of scheduling policies to decide which job to run from the queue when resources free up. By default it’s FIFO (First-Come, First-Serve) modified by priority factors. Administrators configure a priority multifactor plugin that gives each job a priority score based on factors like waiting time (age), job size, user fair-share, quality-of-service (QoS), etc. (Slurm Workload Manager - Overview) (Slurm Workload Manager - Overview). For fairness, SLURM supports fair-share scheduling, where users or projects are assigned shares of the cluster and if someone has used more than their share recently, their new jobs get lower priority (Slurm Workload Manager - Classic Fairshare Algorithm). This prevents one user from monopolizing the cluster – over time the scheduler “banks” usage and biases in favor of those who have run less.

Another important HPC policy is backfill scheduling. Backfilling allows smaller jobs to run out-of-order to avoid wasting idle resources, so long as they don’t delay the start of a higher-priority (often larger) job at the top of the queue (Scheduling Basics - HPC Wiki). The scheduler will reserve nodes for a big job that can’t start yet (maybe waiting for more nodes to free), but in the meantime will backfill shorter jobs into the gaps. This significantly improves utilization: short jobs experience very short queue times since they can slip in, and the large job still starts as soon as its reserved resources become free (Scheduling Basics - HPC Wiki). SLURM’s scheduler can run a backfill cycle periodically to find these opportunities.

SLURM also supports preemption (high-priority jobs can force lower ones off, for example, an urgent job preempts a running job which might be requeued or canceled) and partitions (distinct sets of nodes with their own job queues and limits, analogous to multiple job queues) to enforce different policies. For instance, an “interactive” partition might allow only short jobs for quick turnaround, or a “gpu” partition ensures only GPU-equipped nodes are used for GPU jobs. Priority/Fairness settings along with partitions and QoS give administrators a rich toolset to enforce organizational policies (e.g., research groups get equal share, or certain jobs always have higher priority). The key is that HPC workload managers balance throughput, utilization, and fair access (Scheduling Basics - HPC Wiki): the scheduler tries to minimize wait time, maximize node usage (CPU/GPU not sitting idle), and ensure no single user or project unfairly hogs the cluster.

Control Plane vs. Data Plane in Cluster Systems

In distributed systems, we distinguish the control plane – which makes decisions and orchestrates – from the data plane – which carries out the actual work (Control Plane vs. Data Plane: What’s the Difference? | Kong Inc.). In our context:

2. Architectural Overview

Prometheus Architecture

Prometheus follows a simple but powerful single-server architecture encompassing metric collection, storage, querying, and alerting in one cohesive system (The Architecture of Prometheus. This article explains the Architecture… | by Ju | DevOps.dev). The main components of a Prometheus deployment are:

A simplified view of Prometheus’s architecture is: the Prometheus server scrapes metrics from instrumented jobs (or exporters), stores those samples locally, then runs rules on that data to produce aggregated series or trigger alerts (Overview | Prometheus). Users can query the data via PromQL (directly via API or through Grafana dashboards). Everything is designed to be self-contained and reliable on a single node (no external dependencies for core operation) (Overview | Prometheus). For scalability, you run multiple Prometheus servers (e.g., sharding by metric kind or environment) or use federation/remote write to hierarchical systems, but each one keeps the same simple internal architecture. Figure 1 illustrates Prometheus’s components and ecosystem at a high level (Prometheus: Monitoring at SoundCloud | SoundCloud Backstage Blog) (with optional components like Pushgateway and Grafana as external integrations).

SLURM Architecture

SLURM is built as a distributed, modular system with a central brain and per-node agents to actually execute jobs (Slurm Workload Manager - Overview). The key components of SLURM’s architecture include:

Figure 2 below (SLURM Components) shows a high-level diagram of this architecture (Slurm Workload Manager - Overview). There is a central slurmctld (and optional backup), numerous slurmd daemons on each compute node, an optional slurmdbd connected to a database for accounting, and various clients (commands or API users) interacting with slurmctld. The design is purposefully decentralized for scalability: the heavy work of running jobs is distributed to the slurmds, while the central daemon focuses on scheduling logic and coordination. This has been proven to scale to clusters of tens of thousands of nodes by minimizing per-job overhead and using efficient RPCs between controller and nodes (Slurm Workload Manager - Overview). The SLURM controller and nodes exchange messages for job launch, completion, and heartbeats. If a node fails, slurmctld notes it and can requeue the node’s jobs or alert the admins. If slurmctld fails, the backup can take over without interrupting running jobs (which slurmd will continue to run and then later report to the new controller).

To summarize, SLURM’s architecture consists of: (1) a central brain scheduling jobs and managing state, (2) distributed agents on each node to execute and report on jobs, (3) an optional accounting database for historical data, and (4) a suite of user-facing commands/APIs to interact with the system (Slurm Workload Manager - Overview) (Slurm Workload Manager - Overview). This modular design (with plugins for different features) allows SLURM to run everything from small clusters to the world’s largest supercomputers by adjusting components and scheduling algorithms as needed.

3. Observability and Monitoring in Depth

Instrumentation Best Practices (Counters, Gauges, Histograms)

To make systems observable with Prometheus, proper instrumentation is essential. Prometheus client libraries expose four main metric types, each with best-practice usage patterns (Instrumentation | Prometheus) (Instrumentation | Prometheus):

Some additional instrumentation best practices from Prometheus maintainers (Instrumentation | Prometheus) (Instrumentation | Prometheus):

By adhering to these practices, you ensure that the data Prometheus collects is both useful and efficient to process. Counters for events, gauges for states, histograms for distributions (especially for request latency or sizes which are critical for SLOs), and mindful labeling will lead to an instrumentation that can scale to millions of series without becoming a headache.

Exporters: Node Exporter, cAdvisor, and Kube-State-Metrics

Prometheus’s rich ecosystem of exporters makes it easy to monitor all parts of your stack, including third-party systems and underlying infrastructure, without modifying them. Some key exporters and integrations relevant to large-scale systems and HPC:

The guiding principle of exporters is that they bridge external systems to Prometheus by translating whatever monitoring interface those systems have (be it CLI commands, APIs, or /proc files) into Prometheus metrics exposition format. This avoids having to modify those systems. For instance, Node exporter reads lots of files from /proc and /sys in Linux to get its metrics, cAdvisor taps into the container runtime, and kube-state-metrics uses the K8s API. As a user of Prometheus, you just point Prometheus at these exporters’ endpoints. This provides a uniform view in Prometheus where everything is a time series with labels, even though under the hood data came from very different sources.

SLIs and SRE Principles (Choosing the Right Metrics)

In Google’s Site Reliability Engineering (SRE) practices, a Service Level Indicator (SLI) is a carefully defined metric that reflects the quality of service – essentially, what your users care about. Typical SLIs include request latency, error rate, throughput, availability, etc. (Google SRE - Defining slo: service level objective meaning). For example, an SLI might be “the fraction of requests that succeed” or “99th percentile response time”. These tie into SLOs (objectives) and SLAs (agreements) which set targets for these indicators (e.g., 99.9% of requests under 200ms). When instrumenting systems, it’s important to capture metrics that can serve as SLIs for your services. Prometheus is often used to gather those metrics.

SRE’s Golden Signals: Google’s SRE book suggests monitoring four key signals for any service: Latency, Traffic, Errors, Saturation. Latency and errors are direct indicators of user experience (how fast? how often failures?). Traffic (e.g., QPS or requests per second) gives context of load. Saturation refers to how utilized the system is (CPU, memory, etc.), indicating how close to limits the service is. Prometheus instrumentation should ideally cover these: e.g., an HTTP service would have a latency histogram (for request duration), a counter for requests (traffic) possibly partitioned by outcome (to compute error rate), and gauges or resource metrics to measure saturation (CPU, memory from node exporter, etc.). Many off-the-shelf libraries (like client_http_server in Go or Spring Boot metrics in Java) expose such metrics.

In HPC terms, SLIs could be things like job success rate, scheduling latency (time from job submission to start), or cluster utilization. For instance, if you treat the “cluster” as a service to users, you might have an SLI “job start time within X minutes” or “percentage of nodes available”. However, HPC users often care about throughput (jobs per day) and fairness, which are harder to capture as single SLIs. Still, monitoring things like slurm_job_pending_time_seconds (if such metric is exported per job) could be valuable to see if queue times are rising, indicating potential issues.

Metric design for SLIs: SRE encourages that the metrics you choose should closely measure user-facing outcomes (Google SRE - Defining slo: service level objective meaning) (Google SRE - Defining slo: service level objective meaning). For example, if you run a web portal, an SLI could be the successful page load rate. For an HPC batch system, an SLI for “job scheduling” might be the fraction of jobs started within their expected wait time or the failure rate of jobs due to system errors (not user errors). It’s important to differentiate system reliability from user-caused failures. If a job fails because of a code bug (non-zero exit), that might not count against the service reliability from the SRE perspective (the HPC system was functioning). But if jobs fail due to node crashes or scheduler errors, that is an issue. Thus, one might define an SLI like “% of node-hours available vs total” as an availability measure of the HPC infrastructure.

Using Prometheus, you can implement these SLIs. For example, error rate SLI: define an alert on rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01 to detect if >1% requests are 5xx (server errors). Or for HPC, monitor the slurm_job_failed_count metrics (if available) to alert if system-wide job failures spike. Google’s SRE also emphasizes aggregation and roll-up: raw metrics need to be aggregated to meaningful levels (e.g., overall success rate vs per-instance). Prometheus’s ability to aggregate by labels helps here – you might compute SLIs via recording rules, e.g., a rule that calculates “cluster_job_success_ratio” by dividing successful jobs by total jobs over some window, excluding user error codes.

Another principle is alert on symptoms, not causes. That means set alerts on the SLI breaches (e.g., “error rate > 5%” or “95th latency > 300ms for 15m”), rather than on intermediate metrics like CPU or memory unless they directly impact service. In an HPC context, a symptom alert might be “job queue wait time above 1 hour for highest priority jobs” as a symptom that something’s wrong (maybe cluster is full or scheduler stuck). The cause could be varied (bad node, etc.), but the symptom catches the user-facing impact.

Service Level Objectives (SLOs) tie into this: if your SLO is “99% of jobs complete successfully”, you’ll watch a metric of successful job completion and ensure it stays above 99%. Prometheus can be used to track SLO compliance over time (e.g., using recording rules to compute rolling success rates). A common pattern is to calculate an “error budget burn rate” – how fast are you consuming the allowable errors – using PromQL, and alert if the burn rate is too high.

In summary, when designing monitoring for systems (including HPC clusters), identify your critical SLIs – latency, throughput, success rate, utilization, etc. – and ensure those are measurable via metrics. Use counters/histograms for latency and errors (like request durations, job runtime distribution, error counts), use ratios in PromQL to gauge success rates, and use gauges for saturation (like cluster utilization = used_nodes/total_nodes). By following the SRE approach, you focus on metrics that matter to reliability and user happiness, rather than drowning in irrelevant data. And by leveraging Prometheus’s label and aggregation model, you can get both high-level views and detailed breakdowns as needed for troubleshooting.

High-Cardinality Pitfalls and Metric Design Anti-Patterns

As clusters grow and systems become more complex, it’s easy to be tempted to label metrics with very granular dimensions. However, high-cardinality metrics (metrics with a huge number of unique label combinations) can severely impact Prometheus’s performance and memory use (Metric and label naming | Prometheus). Each unique label-set (time-series) consumes memory for tracking and CPU for processing queries. Let’s discuss a few pitfalls and anti-patterns:

Mitigation and Patterns: If you truly need high-cardinality analysis, one approach is to offload it to a logging or tracing system. Prometheus is optimized for aggregated data. If you needed per-job info, maybe you log each job’s runtime to Elasticsearch or a database, and use Kibana or Spark to analyze it. Prometheus could still alert on, say, “job failure count > 0 in 5m”, but not keep a time-series per job. Another approach is to use exemplars (a newer Prometheus feature) where you attach a trace or log ID to a few samples for context, rather than as a persistent label on all samples.

Also, if certain high-cardinality metrics are valuable but heavy, consider shorter retention or relabelling to drop or aggregate labels at scrape time. For example, you might use Prometheus’s drop/keep configurations to ignore metrics that exceed a cardinality threshold or strip certain labels you don’t need.

In HPC monitoring, one must be especially careful: there are thousands of nodes, hundreds of users, thousands of jobs – metrics involving all three could theoretically generate millions of series (e.g., cpu_usage{node, user, job} is a non-starter!). So metrics exported by slurm or node exporter don’t do that. They’ll give you node_cpu by node and CPU mode (like idle, system, user) which is fixed small set, or slurm_jobs_running{partition} etc. By sticking to aggregated counts (jobs per state, nodes per state, etc.), the SLURM exporter keeps cardinality manageable (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.) (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.). If you extend monitoring, say you write a script to export GPU process info on each node, be cautious not to include the process ID as a label – better to aggregate like “number_of_gpu_jobs_running_on_node”.

In summary: High-cardinality metrics can stealthily degrade your monitoring system. Always design metrics with questions: Do I need this level of detail? Can I aggregate or bucket this? When in doubt, lean toward fewer, more general metrics and use labels that correspond to stable, bounded sets (service names, roles, categories) rather than unbounded sets (user IDs, timestamps). By doing so, you keep Prometheus healthy and queries fast, which in turn ensures you can reliably alert on and understand your system’s behavior (Metric and label naming | Prometheus).

Advanced PromQL: Rates, Subqueries, and Metric Joins

PromQL’s expressiveness allows complex queries to support deep insights and alerting logic. Here we highlight some advanced uses: calculating rates, using subqueries for rolling windows, and joining metrics with vector matching and label manipulation.

PromQL’s advanced features shine in creating derived metrics and sophisticated alerts. For example, consider an alert: “High CPU load relative to cores available”. You have node_load1 (1-minute load average) and node_cpu_seconds_total (for CPU count, you could use the count of mode="idle"+mode="user"+… labels or a known metric). You might do:

(node_load1{job="node"} / count(node_cpu_seconds_total{mode="idle"} by (instance)))
  > 0.8

to detect if load > 80% of CPU count on a node. Here you’re dividing two metrics after matching by instance (implicitly, or explicitly with on(instance)). This kind of algebra makes Prometheus more of a “metrics processing language” than just a query filter.

Another advanced alert could be based on absent() function – for example, if a metric from a service hasn’t been seen for 5 minutes, maybe the service is down. absent(up{job="myservice"} == 1) can fire when there are no targets up for that service.

For joining metrics that don’t share labels, sometimes you label one side via recording rule or static config to enable a join. For instance, if you want to attach datacenter info to each node metric, and you have an external file listing node->DC, you could create a gauge metric like node_meta{instance="foo", datacenter="east"} = 1 for each node (perhaps push it via textfile to node exporter). Then in PromQL do node_cpu_seconds_total * on(instance) group_left(datacenter) node_meta to get a version of node_cpu with the datacenter label.

In HPC monitoring, these techniques are useful. For example, join SLURM metrics with node exporter metrics: If SLURM exporter gives slurm_node_state{state="allocated", nodename="foo"} as 1/0, and node exporter has node_energy_joules_total{instance="foo:9100"}, you might align them by some label (though nodename vs instance string differences require label_replace). One could calculate “power usage of allocated nodes vs idle nodes” by joining the two data sets.

Overall, advanced PromQL allows you to derive insights that raw metrics alone don’t show. It’s often where the monitoring magic happens: defining alerts that combine multiple conditions (e.g., high error rate and high latency -> very bad), or creating summary metrics for dashboards (like cluster efficiency = running cores / total cores, using metrics from Slurm exporter divided by a constant total). Mastering these techniques enables writing precise alerts that reduce noise – for instance, alert on a ratio or trend rather than a single metric threshold. It also helps in capacity planning analysis, anomaly detection, and generating SLO reports from raw data.

4. Resource Scheduling in Depth

SLURM Scheduling Algorithms: FCFS, Fair-Share, Preemption, Backfill, Topology-Aware

SLURM’s scheduling can be tailored, but let’s break down common strategies:

In operation, SLURM’s scheduler (slurmctld) runs periodically (by default every few seconds) to update priorities and attempt to start jobs. It uses the configured algorithm: sort jobs by priority, try to allocate resources, apply backfill to fill gaps, etc. There is also a concept of sched cycle and you can get stats via sdiag (scheduler diagnostics) which shows things like how long each cycle takes and how many jobs were considered/backfilled (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.) (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.). Tuning may be required for very large systems – e.g., limiting how deep in the queue the backfill algorithm looks, to avoid spending too long computing.

Summary: SLURM’s scheduling takes into account when the job was submitted (Age/Age factor), who submitted it (Fairshare usage), what resources it needs (favoring smaller jobs for backfill or via size factor), what QoS/partition it’s in, and any dependencies or reservations. Preemption and backfill improve utilization and responsiveness for high-priority work. Topology and partitions implement site-specific policies and optimizations. The end goal is to keep the cluster busy (no idle resources unless by policy) while meeting priority rules. Admins have a lot of control: they can define QOS (Quality of Service) levels that bundle priority boosts or preemption ability, and assign jobs to QOS. For example, a “high” QOS could give +1000 priority but limited to a group of users or limited in number of jobs. All these knobs influence the multifactor priority plugin’s outcome (Slurm Workload Manager - Overview).

From a monitoring perspective, understanding these algorithms is important: for example, if jobs are waiting a long time, is it because fairshare is throttling someone? Or is a partition at capacity? Metrics like “pending jobs per partition” or “age of oldest job” can reveal if a scheduler policy is causing a bottleneck. SLURM’s exporter even gives some metrics like scheduling cycle time and backfilled jobs count (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.), which can indicate if backfill is actively working or if scheduler is overloaded.

Resource Containers: cgroups and Job Constraints (--mem, --cpus-per-task, --gres)

SLURM ensures that when multiple jobs share a node, they don’t interfere by using Linux cgroups (control groups) to constrain resources per job (Slurm Workload Manager - Control Group in Slurm). Cgroups group a job’s processes and enforce limits on CPU, memory, and more:

The net effect is that each job (or in SLURM terms, each “job step” if srun launches sub-tasks) runs in a sandbox of CPU cores, memory, and devices. This is crucial on shared nodes – without it, one job could hog CPU (say run more threads than requested) or consume all memory (causing others to crash). With cgroups, if a job tries to use more CPU than allocated, the scheduler and OS simply won’t allow its threads onto other cores (they’ll be constrained to their cpuset and if fully busy, additional threads just wait). If it tries to use more memory, it gets killed by cgroup limit.

For HPC jobs that require the whole node, cgroups have less impact (the job already has exclusive node). But in many clusters, nodes are shared among smaller jobs, so this isolation is critical.

Constraint Flags: A few important SLURM resource request flags and their meaning:

These flags translate to internal resource accounting that the scheduler uses to allocate and the slurmd uses to enforce via cgroups and other OS controls (Slurm Workload Manager - Control Group in Slurm) (Slurm Workload Manager - Control Group in Slurm).

On the monitoring side, one interesting aspect is capturing when cgroup limits are hit. For instance, if jobs frequently get killed for OOM (out-of-memory), one might want to alert. SLURM logs that, and some metrics could be derived (maybe the slurm exporter could count job failures by reason). Also, node exporter can show if processes are being throttled by cgroups (some metrics like container_cpu_cfs_throttled_seconds_total if cAdvisor was used). HPC centers sometimes run Prometheus node exporter with cgroup metrics enabled to see per-cgroup resource usage, but that can be high-cardinality (since cgroup names include job IDs by default). If needed, one could aggregate by state: e.g., show total memory used by all jobs on a node vs total memory.

The bottom line is SLURM leverages Linux kernel features to implement resource containers for jobs, very much akin to what container orchestrators (like Kubernetes) do for pods. In fact, running a job in SLURM is conceptually similar to running a Docker container with CPU/mem limits – both end up creating cgroups. HPC users typically don’t notice cgroups except when they hit a limit (job killed for memory, or using more CPUs does nothing). It provides accounting accuracy too – SLURM can precisely measure a job’s CPU time, memory max, etc., via cgroup controllers (for job accounting records).

Scalability: Job Arrays, Controller Failover, and RPC Architecture

Large HPC systems may have to manage many jobs (millions in backlog) and many nodes. SLURM has several features to handle scale:

In essence, SLURM’s design goals include being highly scalable (10k+ nodes, 100k+ jobs in queue) (Slurm Workload Manager - Overview). Features like job arrays and hierarchical communication address the scale of jobs and nodes respectively. Most HPC centers push these to their limits (some do millions of jobs per month and track scheduler latency). When Prometheus monitoring such a system, one could create alerts like “Scheduler cycle time > threshold” from sdiag metrics, to catch if the scheduler is struggling to keep up (which could happen if, say, someone floods 1e6 jobs).

Prometheus itself should be scaled accordingly – scraping a slurm exporter that lists thousands of jobs or nodes is fine (since metrics are aggregated counts, cardinality is not huge). But if one tried to export every job as a metric (bad idea, as discussed), that would hit Prometheus limits.

Accounting and Usage Tracking: slurmdbd and Job States

One of SLURM’s strengths for large clusters is robust accounting: every job’s resource usage and state transitions can be recorded for reporting and enforcing policies:

In summary, SLURM’s accounting provides a rich source of cluster activity data. Not all of it is ideal for time-series metrics due to volume, but key aggregates (job counts by state, resource utilization, wait times) are. The integration of SLURM and Prometheus (through the exporter) focuses on exposing just such aggregated metrics to avoid the deluge. For example, the “Status of the Jobs” metrics in the exporter give a snapshot of how many jobs are in each state (Pending, Running, Suspended, Completed, etc.) (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.), and even categorized pending jobs by reason (dependencies, etc., possibly). This complements what the accounting DB would tell you after the fact by showing real-time queue status. An admin can glance at Grafana and see, for instance, 2000 jobs pending (1000 of them waiting for a dependency, 1000 for resources), 500 running, 2 nodes down, scheduler is keeping up (short cycle times), etc. All these are crucial for managing an HPC facility effectively.

5. Integration & Co-Deployment of Prometheus and SLURM

Prometheus + SLURM: Metrics Exporter and Grafana Dashboards

By integrating Prometheus monitoring with a SLURM-managed cluster, administrators can get real-time visibility into both the infrastructure and the workload scheduler. The primary integration point is the SLURM exporter for Prometheus. This is a software (often a Python or Go daemon) that queries SLURM (via commands or the Slurm REST API) and exposes metrics for Prometheus. One such exporter is available on GitHub (Prometheus exporter for performance metrics from Slurm. - GitHub) and is also referenced in Grafana dashboards (SLURM Dashboard | Grafana Labs ).

SLURM Exporter Metrics: The SLURM Prometheus exporter typically provides metrics on: cluster resource usage, job queue stats, and scheduler performance. For example, it exposes:

Grafana Dashboards: Grafana is commonly used to visualize Prometheus data. There are community dashboards for SLURM (as indicated by Grafana Labs dashboard ID 4323 (SLURM Dashboard | Grafana Labs )). These dashboards typically have rows for:

From the snippet, the Grafana dashboard lists the metrics displayed: State of CPUs/GPUs, State of Nodes, Status of Jobs (with breakdown by Account/User), Scheduler Info, Share Info (SLURM Dashboard | Grafana Labs ). Essentially, exactly those provided by the exporter.

By deploying this, an HPC ops team gets a live view rather than relying solely on command-line tools. For example, instead of running sinfo and eyeballing, they have a time history of how many nodes were idle over the last week, or how backlog grew during a big submission burst. This can inform if they need more hardware or if scheduling parameters need tuning (like if many nodes idle but jobs pending, maybe scheduling fragmentation issues or job sizes that don't fit current nodes).

Setting up the exporter usually means running it on the SLURM controller node (where it can query slurmctld or run commands like sinfo, squeue). It then listens on a port (say 8080) for Prometheus to scrape. The Prometheus scrape config for it might look like:

scrape_configs:
- job_name: slurm
  static_configs:
    - targets: ['slurm-master.example.com:8080']
  scrape_interval: 30s
  scrape_timeout: 30s

They even recommend in the docs to use 30s interval and timeout to avoid overloading slurmctld (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.) (GitHub - vpenso/prometheus-slurm-exporter: Prometheus exporter for performance metrics from Slurm.), since some of these queries (like squeue for all jobs) can be heavy. A 30s interval strikes a balance between freshness and load.

In addition to cluster-wide dashboards, one could integrate with Grafana’s ad-hoc filters or variables to drill into specific users or partitions. E.g., a variable to select an account and then show how many jobs that account has had over time.

Alerts with Prometheus: With SLURM metrics in Prometheus, we can set up alerts to catch conditions:

These alerts help operators be proactive. For example, an alert for idle nodes while queue is long could catch if a partition was mistakenly left in a drained state.

Alerting on Failed Jobs, Idle Nodes, and Resource Saturation

Let’s consider some concrete alert scenarios combining Prometheus and SLURM knowledge:

Creating these alerts in Prometheus Alertmanager completes the integration: you get pager or email alerts for HPC cluster conditions just as you would for a web service.

Metric-Log Correlation (Prometheus with Loki/Elasticsearch)

Metrics give numeric overviews, but often we need to drill into logs for details – for instance, if an alert “node down” fires, one will want to see that node’s logs around the failure. By correlating Prometheus metrics with logs (via Loki or Elastic), operators can pivot from an alert to relevant logs easily.

Grafana Loki is a log aggregation system that integrates well with Prometheus. It stores logs with labels (similar to Prom labels) and allows queries for log streams. A common practice is to use consistent labels for metrics and logs, so that Grafana’s “Explore” can link them. For example, label logs with hostname and metrics also have instance or nodename. If an alert comes in “node foo is down”, one can filter logs in Loki for {hostname="foo"} and time range around the alert to see what happened (maybe hardware error messages, kernel panic, etc.). Grafana even supports linking an exemplar trace or logs to a metric datapoint if you set it up.

Elasticsearch/Kibana similarly could be used – for instance, system logs (syslog, dmesg, slurmctld logs) shipped to Elastic. If an alert “Job failure spike” occurs, you’d search the SLURM controller log for error lines (e.g., node communication errors).

To make correlation smoother, some integrate via Grafana’s explore: you can have a graph of “Failed jobs” in Grafana, click on a data point spike, and jump to logs at that timestamp with the relevant context. Loki can use the same label (like job ID or node) if those are included in log messages.

For HPC, interesting logs include:

If using something like Elastic Stack, one could index events like job completion as structured logs (via Slurm accounting or a DB hook). But that might duplicate metrics. Instead, focus on things metrics can’t tell: cause of failure, performance messages.

One cool approach is linking trace IDs: for microservices, PromQL can attach a trace exemplar (like Jaeger trace ID) which Grafana can link to. In HPC, not typically applicable unless using distributed tracing for say workflow steps.

But a simpler integration: set up Alertmanager to include links to Kibana/Loki in alerts. E.g., an alert for NodeDown could include a templated link: “See logs: http://lokiserver/grafana/explore?left=...&expr={hostname="$labels[node]"}”. This way, when an alert is received, the on-call can click the link to jump to logs of that node.

Prometheus and Loki use-case: Suppose an alert “Scheduler Cycle Time > 30s” triggers (meaning slurmctld is having trouble). The admin can check logs on the controller via Loki to find if any specific job or event is causing slurmctld issues (maybe an error like “error communicating with node X” repeating). The metrics indicated a problem, the logs detail it.

Another example: An alert “High Job Failures” triggers. The alert provides maybe a count and perhaps a sample job ID that failed. The admin can search in slurmctld log for that job ID to see why it failed (slurmctld log records job end with exit code or failure reason). Or search node logs for that job’s execution node to see a segfault or OOM message.

In integration practice, one ensures that job IDs, node names, user IDs, etc., appear in both metrics and logs. SLURM logs do contain job IDs and node names. The exporter metrics label things by user, account, partition. So you can go from “user X has lots of failing jobs” (metrics) to logs filter UID X (slurmctld usually logs user submitting or finishing job).

Thus, metrics give the “what” and logs give the “why.” A tight integration means in Grafana, within the same interface, you can select “Explore” on a panel and switch to logs mode, carrying over relevant labels/time. This greatly speeds up diagnosis compared to manually opening a separate Kibana and copying times.

Hybrid Clusters: SLURM and Kubernetes Monitoring

Increasingly, organizations may run both HPC workloads via SLURM and containerized workloads via Kubernetes (K8s), sometimes on the same hardware or at least in the same datacenter. This introduces complexity in monitoring because you effectively have two orchestrators. Some strategies and concerns:

From a Prometheus config perspective, you might have multiple scrape jobs: one for node exporter on all nodes, one for slurm exporter, one for kube-state-metrics, one for cadvisor (which might be via the kubelet or through Prom operator). They can all live in one Prom server. Use recording rules to combine if needed (like sum(node_cpu_seconds_total{mode!="idle", cluster="hpc"}) to see total CPU usage across all nodes, whether by K8s or HPC jobs – though that just shows OS usage, not attribution to SLURM vs K8s).

A specific integration example: some sites run Kubeflow or Argo Workflows on top of Kubernetes for machine learning, but use SLURM for big training jobs. They want a unified view of “all ML tasks running”. They could label their Prom metrics such that they can filter “show me only ML tasks on HPC vs on K8s”.

Finally, for maintainers, a hybrid cluster means two schedulers to monitor. Ensuring both are healthy is critical. Prometheus could alert on both slurmctld issues and on Kubernetes API server issues, etc. Having them in one system prevents missing an issue by only looking at one side.

In short, co-deploying Prometheus with both HPC and cloud orchestrators gives a single pane of glass for admins. They can correlate events between systems. For example, maybe a K8s batch job triggers a SLURM job through some bridge – if that pipeline stalls, one can see if either side had a problem (K8s part success metric vs slurm job pending metrics). Such holistic monitoring is increasingly valuable as HPC and cloud converge in many environments.

Prometheus at Scale: Cardinality, Long-Term Storage (Thanos/Cortex), OpenTelemetry

As monitoring needs grow (in terms of metrics count, retention, and integration with tracing), new challenges and solutions arise:

SLURM Futures: Containers, ML Workloads, and Workflow Integration

The HPC scheduling world is also evolving to address new demands:

In conclusion, the landscape is converging: HPC workload managers like SLURM are adopting features from cloud orchestrators (dynamic scaling, container support), while cloud systems are learning from HPC (e.g., batch scheduling for GPU jobs). Observability needs to keep up: it’s likely we’ll see unified monitoring frameworks (maybe OTel-based) that cover both HPC and cloud apps. Prometheus and its ecosystem are actively evolving in this direction: integration with OpenTelemetry, scaling out via Thanos/Cortex, etc., to remain the backbone of monitoring in these hybrid environments.

Finally, academic research continues on scheduling algorithms (e.g., energy-aware scheduling, or scheduling to minimize cloud cost in hybrid clusters). Implementing these could involve new metrics – like power consumption per job, or cost tracking. Indeed, if jobs can run on-prem or cloud, a scheduler might pick the cheapest option given a time constraint. Then one would monitor budget usage and efficiency.

By staying aware of these trends – containerization, hybrid cloud, advanced scheduling – early-career systems engineers can design monitoring and management solutions that are future-proof and adaptable. Both Prometheus and SLURM are mature but actively developed projects, so new features (like native histograms in Prometheus or Slurm REST API improvements or new exporter metrics) will continue to appear, enhancing what we can observe and automate in large-scale distributed systems.