Cluster Monitoring with Slurm Native OpenMetrics

TL;DR

Starting with Slurm 25.11, slurmctld directly exposes metrics in OpenMetrics format (the Prometheus metric standard). You no longer need to install a separate binary like prometheus-slurm-exporter that parses squeue/sinfo output. Just add one line to slurm.conf and write a Prometheus scrape config.

In this post, I share how I set up this feature on a real multi-GPU cluster. I also published the Grafana dashboard I built to the marketplace.


The Problem with the Old Approach

Until now, adding Prometheus monitoring to a Slurm cluster required this architecture:

slurmctld  squeue/sinfo/sdiag  prometheus-slurm-exporter(:9341)  Prometheus  Grafana

A separate exporter binary like vpenso/prometheus-slurm-exporter would periodically run CLI commands such as squeue, sinfo, and sdiag, parse their output, and convert it into Prometheus metrics. It worked, but came with several pain points:

  • Managing an extra binary: You had to build or download the exporter separately and register it as a systemd service.
  • CLI parsing fragility: If the squeue output format changed, the exporter could break.
  • Metric accuracy: Because the exporter periodically calls CLI commands, there is a time lag between the exporter and slurmctld.
  • Maintenance burden: Every time Slurm was upgraded, you had to verify exporter compatibility.

In practice, the canonical exporter prometheus-slurm-exporter went unmaintained for a long time, leading to a proliferation of forks such as rivosinc/prometheus-slurm-exporter and SckyzO/slurm_exporter.

What Changed in 25.11

Starting with Slurm 25.11, slurmctld exposes metrics conforming to the OpenMetrics 1.0 spec directly via an HTTP endpoint:

slurmctld(:6817) /metrics/* ← Prometheus ← Grafana

The exporter is gone. slurmctld exposes its own internal state directly. Because it exports internal data rather than parsing CLI output, the metrics are accurate and fast.

With official native metrics now available, the need for community exporters will naturally diminish. (Slurm 25.11 Release Notes, Metrics Guide)

Configuration

1. slurm.conf

Only one line needs to be added:

MetricsType=metrics/openmetrics

Restart slurmctld after the change:

systemctl restart slurmctld

slurmctld will now serve metrics over HTTP on its default port (6817).

Caveats

  • Metrics are disabled if PrivateData is set. If your slurm.conf contains a PrivateData parameter, the metrics feature will not work.
  • There is no authentication. Anyone who can send an HTTP request to the slurmctld port can access the metrics, so firewall or network access controls are required.
  • Be mindful of scrape frequency. Metric requests acquire an internal lock inside slurmctld, so scraping too frequently can impact scheduler performance. The official documentation recommends a 60–120 second interval (Metrics Guide).

2. Verifying the Endpoints

Once configured, you can verify immediately with curl:

# List available endpoints
curl http://localhost:6817/metrics

# Job status
curl http://localhost:6817/metrics/jobs

# Node resources
curl http://localhost:6817/metrics/nodes

# Partition status
curl http://localhost:6817/metrics/partitions

# Scheduler performance
curl http://localhost:6817/metrics/scheduler

# Per-user/account jobs
curl http://localhost:6817/metrics/jobs-users-accts

Example response (/metrics/nodes):

# HELP slurm_node_cpus Total number of cpus in the node
# TYPE slurm_node_cpus gauge
slurm_node_cpus{node="gpu-node-01"} 64
slurm_node_cpus{node="gpu-node-02"} 128
slurm_node_cpus{node="cpu-node-01"} 48
# HELP slurm_node_cpus_alloc Allocated cpus in the node
# TYPE slurm_node_cpus_alloc gauge
slurm_node_cpus_alloc{node="gpu-node-01"} 0
slurm_node_cpus_alloc{node="gpu-node-02"} 32
# EOF

3. Health Check Endpoints (Bonus)

25.11 also adds HTTP health check endpoints in addition to metrics:

# slurmctld
curl http://localhost:6817/livez    # Process liveness check
curl http://localhost:6817/readyz   # Ready to handle requests
curl http://localhost:6817/healthz  # Overall health status

# slurmd (port 6818 on each node)
curl http://<node-ip>:6818/livez

Combined with the Blackbox Exporter, you can monitor the status of Slurm daemons directly from Prometheus.

4. Prometheus Scrape Config

Add a scrape job for each endpoint in prometheus.yml. Separating by endpoint lets you tune the interval and timeout independently as needed:

scrape_configs:
  # Job status (running, pending, completed, etc.)
  - job_name: "slurm-native-jobs"
    scrape_interval: 60s
    metrics_path: "/metrics/jobs"
    static_configs:
      - targets: ["<slurmctld-host>:6817"]

  # Node resources (CPU, memory allocation)
  - job_name: "slurm-native-nodes"
    scrape_interval: 60s
    metrics_path: "/metrics/nodes"
    static_configs:
      - targets: ["<slurmctld-host>:6817"]

  # Per-partition status
  - job_name: "slurm-native-partitions"
    scrape_interval: 60s
    metrics_path: "/metrics/partitions"
    static_configs:
      - targets: ["<slurmctld-host>:6817"]

  # Scheduler performance (backfill, cycle time, etc.)
  - job_name: "slurm-native-scheduler"
    scrape_interval: 60s
    metrics_path: "/metrics/scheduler"
    static_configs:
      - targets: ["<slurmctld-host>:6817"]

  # Per-user/account jobs
  - job_name: "slurm-native-users-accts"
    scrape_interval: 60s
    metrics_path: "/metrics/jobs-users-accts"
    static_configs:
      - targets: ["<slurmctld-host>:6817"]

Note: The /metrics/jobs-users-accts endpoint generates time series proportional to the number of users. On clusters with many users, use a generous scrape interval.

Reload the Prometheus configuration:

# Method 1: HTTP API
curl -X POST http://localhost:9090/-/reload

# Method 2: Signal
kill -HUP $(pidof prometheus)

Verify that all targets show UP status in the Prometheus UI (http://localhost:9090/targets).

Complete Metrics Reference

/metrics/jobs — Cluster-wide Job Status

MetricDescription
slurm_jobs_runningNumber of running jobs
slurm_jobs_pendingNumber of pending jobs
slurm_jobs_completedNumber of completed jobs
slurm_jobs_failedNumber of failed jobs
slurm_jobs_cpus_allocTotal allocated CPUs
slurm_jobs_memory_allocTotal allocated memory
slurm_jobs_nodes_allocTotal allocated nodes
slurm_jobs_timeoutNumber of timed-out jobs
slurm_jobs_outofmemoryNumber of jobs failed due to OOM

Beyond these, all Slurm job states are exposed as metrics, including cancelled, completing, configuring, suspended, and preempted.

/metrics/nodes — Per-node Resources

MetricLabelDescription
slurm_node_cpus{node="..."}nodeTotal node CPUs
slurm_node_cpus_alloc{node="..."}nodeAllocated CPUs
slurm_node_cpus_idle{node="..."}nodeIdle CPUs
slurm_node_memory_bytes{node="..."}nodeTotal memory
slurm_node_memory_alloc_bytes{node="..."}nodeAllocated memory
slurm_node_memory_free_bytes{node="..."}nodeFree memory
slurm_nodes_idleNumber of idle nodes
slurm_nodes_mixedNumber of mixed nodes
slurm_nodes_allocNumber of allocated nodes
slurm_nodes_downNumber of down nodes
slurm_nodes_drainNumber of draining nodes

/metrics/partitions — Per-partition Status

Node and job metrics are provided at partition granularity with a {partition="..."} label. For example:

  • slurm_partition_jobs_running{partition="batch"} — running jobs in the batch partition
  • slurm_partition_nodes_cpus_alloc{partition="batch"} — allocated CPUs in the batch partition
  • slurm_partition_nodes_mem_alloc{partition="batch"} — allocated memory in the batch partition

/metrics/scheduler — Internal Scheduler Performance

MetricDescription
slurm_sched_mean_cycleMain scheduler average cycle time (µs)
slurm_bf_mean_cycleBackfill scheduler average cycle time (µs)
slurm_bf_cycle_lastLast backfill cycle time
slurm_bf_depth_meanAverage backfill search depth
slurm_bf_queue_lenBackfill queue length
slurm_schedule_queue_lenScheduler queue length
slurm_slurmdbd_queue_sizeslurmdbd queue size
slurm_backfilled_jobsTotal backfilled jobs (cumulative)
slurm_sdiag_latencyRPC response latency
slurm_server_thread_cntActive slurmctld thread count

/metrics/jobs-users-accts — Per-user/Account

Job metrics broken down by user ({user="..."}) and account ({account="..."}):

  • slurm_user_jobs_running{user="alice"} — alice’s running jobs
  • slurm_user_jobs_cpus_alloc{user="alice"} — CPUs in use by alice
  • slurm_account_jobs_pending{account="default"} — pending jobs for the default account

This endpoint is useful for tracking per-user resource usage or monitoring fairshare distribution.

Metric Name Comparison with the Legacy Exporter

If you were using prometheus-slurm-exporter before, note that metric names are completely different. Existing Grafana dashboards cannot be reused as-is; queries must be rewritten.

Legacy Exporter25.11 NativeDescription
slurm_cpus_allocslurm_jobs_cpus_allocTotal allocated CPUs
slurm_cpus_idlesum(slurm_node_cpus_idle)Total idle CPUs
slurm_cpus_totalsum(slurm_node_cpus)Total CPUs
slurm_nodes_allocslurm_nodes_allocAllocated nodes (same name)
slurm_nodes_idleslurm_nodes_idleIdle nodes (same name)
slurm_node_cpu_allocslurm_node_cpus_allocPer-node CPU
slurm_node_mem_allocslurm_node_memory_alloc_bytesPer-node memory
slurm_scheduler_mean_cycleslurm_sched_mean_cycleScheduler cycle time
slurm_scheduler_backfilled_jobs_since_startslurm_backfilled_jobsBackfilled jobs count
slurm_account_fairshareNot available in native

Per-node metrics (slurm_node_*) carry a {node="..."} label, and fairshare metrics carry {account="...", user="..."} labels.

Some metrics differ only in name, while others are structured entirely differently. In particular, per-partition and per-user metrics are far richer in the native implementation. Fairshare metrics, however, are not yet included in the native endpoint.

Grafana Dashboard

All existing Slurm dashboards on the Grafana Labs marketplace are based on the exporter:

I built a dashboard based on the 25.11 native metrics and published it to the marketplace:

Dashboard Layout

The dashboard is organized into five sections:

1. Cluster Summary — Stat panels showing running/pending jobs, CPU/memory utilization, and node state at a glance.

Key queries:

# CPU utilization
sum(slurm_node_cpus_alloc) / sum(slurm_node_cpus)

# Memory utilization
sum(slurm_node_memory_alloc_bytes) / sum(slurm_node_memory_bytes)

2. Job Trends — Timeseries panels showing job state changes over time and completion/start/submission rates.

# Job completion rate per minute
rate(slurm_sdiag_jobs_completed[5m]) * 60

3. Per-node Resources — Tracks CPU/memory allocation and utilization for each node. Useful for identifying nodes under concentrated load.

# Per-node CPU utilization
slurm_node_cpus_alloc / slurm_node_cpus

4. Per-user Jobs — Stacked timeseries showing how much resources each user is consuming. Useful for detecting resource concentration.

# Show only non-zero values (reduce noise)
slurm_user_jobs_running > 0
slurm_user_jobs_cpus_alloc > 0

5. Scheduler Performance — Internal performance metrics for slurmctld. Abnormally long scheduler cycle times can indicate throughput issues.

# Scheduler cycle time (µs)
slurm_sched_mean_cycle
slurm_bf_mean_cycle

# slurmdbd queue size (growing = problem)
slurm_slurmdbd_queue_size

Slurm Native OpenMetrics Dashboard

The dashboard can be imported from the Grafana marketplace.

Summary

ItemLegacy (Exporter)25.11 (Native)
Extra binaryRequiredNot required
Data collectionCLI parsing (squeue, sinfo)Direct internal data from slurmctld
Prometheus configRequiredRequired (same)
Metric accuracyDepends on CLI poll intervalReal-time internal state
Metric coverageLimitedRich (per-partition, per-user, detailed scheduler)
MaintenanceManage exporter separatelyIncluded in Slurm package
Grafana dashboardMany on marketplaceListed on marketplace (#24979)

Slurm 25.11’s native OpenMetrics simplifies the architecture while providing richer metrics. The biggest advantages are eliminating the need to manage a separate exporter and being able to observe slurmctld’s internal state directly.

That said, this is still an early-stage feature — there is no authentication yet, and some metrics like fairshare are not yet included. I look forward to seeing these gaps addressed in future releases.


This post was written in a Slurm 25.11.2 + Prometheus 2.51 + Grafana 10.4 environment.

김종록

현재 삼성 SDS에서 클라우드 엔지니어로 활동하고 있습니다.

Previous