---
title: Slurm
description: Monitor Slurm cluster resource usage, job statuses, and system performance.
breadcrumbs: Docs > Integrations > Slurm
---

> For the complete documentation index, see [llms.txt](https://docs.datadoghq.com/llms.txt).

# Slurm
Supported OS Integration version2.5.0
{% callout %}
# Important note for users on the following Datadog sites: us2.ddog-gov.com

{% alert level="info" %}
To find out if this integration is available in your organization, see your [Datadog Integrations](https://app.datadoghq.com/integrations) page or ask your organization administrator.

To initiate an exception request to enable this integration for your organization, email [support@ddog-gov.com](mailto:support@ddog-gov.com).
{% /alert %}

{% /callout %}

## Overview{% #overview %}

This check monitors [Slurm](https://slurm.schedmd.com/overview.html) through the Datadog Agent.

Slurm (Simple Linux Utility for Resource Management) is an open-source workload manager used to schedule and manage jobs on large-scale compute clusters. It allocates resources, monitors job queues, and ensures efficient execution of parallel and batch jobs in high-performance computing environments.

The check collects metrics from the head node (`slurmctld`) by executing and parsing the output of several command-line binaries: [`sinfo`](https://slurm.schedmd.com/sinfo.html), [`squeue`](https://slurm.schedmd.com/squeue.html), [`sacct`](https://slurm.schedmd.com/sacct.html), [`sdiag`](https://slurm.schedmd.com/sdiag.html), and [`sshare`](https://slurm.schedmd.com/sshare.html). These commands provide detailed information about resource availability, job queues, accounting, diagnostics, and share usage in a Slurm-managed cluster.

On worker nodes, the check can also collect metrics using [`scontrol`](https://slurm.schedmd.com/scontrol.html), which provides process IDs (PIDs) and other job information that is not available through the head node.

**Minimum Agent version:** 7.59.0

## Setup{% #setup %}

Follow the instructions below to install and configure this check for an Agent running on a host. Since the Agent requires direct access to the various Slurm binaries, monitoring Slurm in containerized environments is not recommended.

**Note**: This check was tested on Slurm version 21.08.0.

### Installation{% #installation %}

The Slurm check is included in the [Datadog Agent](https://app.datadoghq.com/account/settings/agent/latest) package. No additional installation is needed on your server.

### Configuration{% #configuration %}

#### Head Node{% #head-node %}

1. Ensure that the dd-agent user has execute permissions on the relevant command binaries and the necessary permissions to access the directories where these binaries are located.

1. Edit the `slurm.d/conf.yaml` file, in the `conf.d/` folder at the root of your Agent's configuration directory to start collecting your Slurm data. See the [sample slurm.d/conf.yaml](https://github.com/DataDog/integrations-core/blob/master/slurm/datadog_checks/slurm/data/conf.yaml.example) for all available configuration options.

```yaml
init_config:

    ## Customize this part if the binaries are not located in the /usr/bin/ directory
    ## @param slurm_binaries_dir - string - optional - default: /usr/bin/
    ## The directory in which all the Slurm binaries are located. These are mainly:
    ## sinfo, sacct, sdiag, sshare and sdiag.

    slurm_binaries_dir: /usr/bin/

instances:

  -
    ## Configure these parameters to select which data the integration collects.
    ## @param collect_sinfo_stats - boolean - optional - default: true
    ## Whether or not to collect statistics from the sinfo command.
    #
    collect_sinfo_stats: true

    ## @param collect_sdiag_stats - boolean - optional - default: true
    ## Whether or not to collect statistics from the sdiag command.
    #
    collect_sdiag_stats: true

    ## @param collect_squeue_stats - boolean - optional - default: true
    ## Whether or not to collect statistics from the squeue command.
    #
    collect_squeue_stats: true

    ## @param collect_sacct_stats - boolean - optional - default: true
    ## Whether or not to collect statistics from the sacct command.
    #
    collect_sacct_stats: true

    ## @param collect_sshare_stats - boolean - optional - default: true
    ## Whether or not to collect statistics from the sshare command.
    #
    collect_sshare_stats: true

    ## @param collect_gpu_stats - boolean - optional - default: false
    ## Whether or not to collect GPU statistics when Slurm is configured to use GPUs using sinfo.
    #
    collect_gpu_stats: true

    ## @param sinfo_collection_level - integer - optional - default: 1
    ## The level of detail to collect from the sinfo command. The default is 'basic'. Available options are 1, 2 and
    ## 3. Level 1 collects data only for partitions. Level 2 collects data from individual nodes. Level 3 
    ## collects data from from individual nodes as well but is more verbose and includes data such as CPU and 
    ## memory usage as reported from the OS, as well as additional tags.
    #
    sinfo_collection_level: 3

    ## @param collect_scontrol_stats - boolean - optional - default: false
    ## Whether or not to collect statistics from the scontrol command. This is mainly used in the worker 
    ## node to collect the list of running jobs along with their PIDs.
    collect_scontrol_stats: false # This should only be set on worker nodes and not the head node
```
[Restart the Agent](https://docs.datadoghq.com/agent/guide/agent-commands.md#start-stop-and-restart-the-agent).
#### Worker Nodes{% #worker-nodes %}

The `slurm.scontrol.job.info` metric can only be collected from worker nodes. This metric enables the submission of important tags that can be used to monitor the resource consumption of specific job steps.

1. Ensure that the dd-agent user has execute permissions on the relevant `scontrol` binaries and the necessary permissions to access the directories where these binaries are located.

1. Edit the `slurm.d/conf.yaml` file, in the `conf.d/` folder at the root of your Agent's configuration directory to start collecting your Slurm data. See the [sample slurm.d/conf.yaml](https://github.com/DataDog/integrations-core/blob/master/slurm/datadog_checks/slurm/data/conf.yaml.example) for all available configuration options.

```yaml
init_config:

    ## Customize this part if the binaries are not located in the /usr/bin/ directory
    ## @param slurm_binaries_dir - string - optional - default: /usr/bin/
    ## The directory in which all the Slurm binaries are located. These are mainly:
    ## sinfo, sacct, sdiag, and sshare.

    slurm_binaries_dir: /usr/bin/

instances:

  - 
    ## @param collect_scontrol_stats - boolean - optional - default: false
    ## Whether or not to collect statistics from the scontrol command. This is mainly used in the worker 
    ## node to collect the list of running jobs along with their PIDs.
    collect_scontrol_stats: true

    # The rest of these settings need to be turned off on the worker node because the information is specific
    # to the head node and isn't retrievable on the worker node.
    collect_sinfo_stats: false
    collect_sdiag_stats: false
    collect_squeue_stats: false
    collect_sacct_stats: false
    collect_sshare_stats: false
    collect_gpu_stats: false
    sinfo_collection_level: 1
```
[Restart the Agent](https://docs.datadoghq.com/agent/guide/agent-commands.md#start-stop-and-restart-the-agent).
### Validation{% #validation %}

[Run the Agent's status subcommand](https://docs.datadoghq.com/agent/guide/agent-commands.md#agent-status-and-information) and look for `slurm` under the Checks section.

## Data Collected{% #data-collected %}

### Metrics{% #metrics %}

|  |
|  |
| **slurm.node.alloc\_mem**(gauge)                                 | Number of megabytes allocated on the node.*Shown as megabyte*                                                                  |
| **slurm.node.cpu.allocated**(gauge)                              | Number of CPUs allocated on the node for job-related tasks.*Shown as cpu*                                                      |
| **slurm.node.cpu.idle**(gauge)                                   | Number of idle CPUs on the node.*Shown as cpu*                                                                                 |
| **slurm.node.cpu.other**(gauge)                                  | Number of CPUs performing other or non-job-related tasks on the node.*Shown as cpu*                                            |
| **slurm.node.cpu.total**(gauge)                                  | Total number of CPUs on the node.*Shown as cpu*                                                                                |
| **slurm.node.cpu\_load**(gauge)                                  | CPU load on the node as reported by the OS.                                                                                    |
| **slurm.node.free\_mem**(gauge)                                  | Free memory on the node as reported by the OS.*Shown as megabyte*                                                              |
| **slurm.node.gpu\_total**(gauge)                                 | Total number of GPUs on the node.                                                                                              |
| **slurm.node.gpu\_used**(gauge)                                  | Number of GPUs used on the node.                                                                                               |
| **slurm.node.info**(gauge)                                       | Information about the Slurm node.                                                                                              |
| **slurm.node.memory**(gauge)                                     | Total memory on the node as reported by the OS.*Shown as megabyte*                                                             |
| **slurm.node.tmp\_disk**(gauge)                                  | Temporary disk space on the node as reported by the OS.*Shown as megabyte*                                                     |
| **slurm.partition.cpu.allocated**(gauge)                         | (Deprecated) Number of CPUs allocated on the partition for job-related tasks.*Shown as cpu*                                    |
| **slurm.partition.cpu.idle**(gauge)                              | (Deprecated) Number of idle CPUs on the partition.*Shown as cpu*                                                               |
| **slurm.partition.cpu.other**(gauge)                             | (Deprecated) Number of CPUs performing other or non-job-related tasks on the partition.*Shown as cpu*                          |
| **slurm.partition.cpu.total**(gauge)                             | (Deprecated) Total number of CPUs on the partition.*Shown as cpu*                                                              |
| **slurm.partition.gpu\_total**(gauge)                            | Total number of GPUs on the partition.                                                                                         |
| **slurm.partition.gpu\_used**(gauge)                             | Number of GPUs used on the partition.                                                                                          |
| **slurm.partition.info**(gauge)                                  | Information about the Slurm partition.                                                                                         |
| **slurm.partition.node.allocated**(gauge)                        | Number of nodes allocated on the partition for job-related tasks.*Shown as node*                                               |
| **slurm.partition.node.idle**(gauge)                             | Number of idle nodes on the partition.*Shown as node*                                                                          |
| **slurm.partition.node.other**(gauge)                            | Number of nodes performing other or non-job-related tasks on the partition.*Shown as node*                                     |
| **slurm.partition.node.total**(gauge)                            | Total number of nodes on the partition.*Shown as node*                                                                         |
| **slurm.partition.nodes.count**(gauge)                           | Number of nodes in the partition.*Shown as node*                                                                               |
| **slurm.sacct.enabled**(gauge)                                   | Shows whether we're collecting sacct metrics or not for this host.                                                             |
| **slurm.sacct.job.duration**(gauge)                              | Duration of the job in seconds.*Shown as second*                                                                               |
| **slurm.sacct.job.info**(gauge)                                  | Information about the Slurm job in sacct.                                                                                      |
| **slurm.sacct.slurm\_job\_ave\_disk\_read**(gauge)               | Average number of bytes read from disk by a job or step.*Shown as byte*                                                        |
| **slurm.sacct.slurm\_job\_avgcpu**(gauge)                        | Average (system + user) CPU time of all tasks in job.*Shown as second*                                                         |
| **slurm.sacct.slurm\_job\_avgrss**(gauge)                        | Average resident set size of all tasks in job.*Shown as byte*                                                                  |
| **slurm.sacct.slurm\_job\_cputime**(gauge)                       | Time used (Elapsed time * CPU count) by a job or step in cpu-seconds.*Shown as second*                                         |
| **slurm.sacct.slurm\_job\_max\_disk\_read**(gauge)               | Maximum number of bytes read from disk by a job or step.*Shown as byte*                                                        |
| **slurm.sacct.slurm\_job\_maxrss**(gauge)                        | Maximum resident set size of all tasks in job.*Shown as byte*                                                                  |
| **slurm.sacct.slurm\_job\_maxvm**(gauge)                         | Maximum virtual memory size of all tasks in job.*Shown as byte*                                                                |
| **slurm.scontrol.job.info**(gauge)                               | Status of running jobs on worker node.*Shown as job*                                                                           |
| **slurm.sdiag.agent\_count**(gauge)                              | Number of agent threads.*Shown as thread*                                                                                      |
| **slurm.sdiag.agent\_queue\_size**(gauge)                        | Number of enqueued outgoing RPC requests in an internal retry list.*Shown as request*                                          |
| **slurm.sdiag.agent\_thread\_count**(gauge)                      | Total count of active threads created by all the agent threads.*Shown as thread*                                               |
| **slurm.sdiag.backfill.depth\_mean**(gauge)                      | Mean count of jobs processed during all backfilling scheduling cycles since last reset.*Shown as job*                          |
| **slurm.sdiag.backfill.depth\_mean\_try\_depth**(gauge)          | The subset of Depth Mean that the backfill scheduler attempted to schedule.*Shown as job*                                      |
| **slurm.sdiag.backfill.last\_cycle**(gauge)                      | Time in microseconds of last backfill scheduling cycle.*Shown as microsecond*                                                  |
| **slurm.sdiag.backfill.last\_cycle\_seconds\_ago**(gauge)        | Time in seconds since the last scheduling cycle.*Shown as second*                                                              |
| **slurm.sdiag.backfill.last\_depth\_cycle**(gauge)               | Number of processed jobs during last backfilling scheduling cycle.*Shown as job*                                               |
| **slurm.sdiag.backfill.last\_depth\_try\_schedule**(gauge)       | Number of processed jobs during last backfilling scheduling cycle.*Shown as job*                                               |
| **slurm.sdiag.backfill.last\_queue\_length**(gauge)              | Number of jobs pending to be processed by backfilling algorithm.*Shown as job*                                                 |
| **slurm.sdiag.backfill.last\_table\_size**(gauge)                | Number of different time slots tested by the backfill scheduler in its last iteration.                                         |
| **slurm.sdiag.backfill.max\_cycle**(gauge)                       | Time in microseconds of maximum backfill scheduling cycle execution since last reset.*Shown as microsecond*                    |
| **slurm.sdiag.backfill.mean\_cycle**(gauge)                      | Mean time in microseconds of backfilling scheduling cycles since last reset.*Shown as microsecond*                             |
| **slurm.sdiag.backfill.mean\_table\_size**(gauge)                | Mean count of different time slots tested by the backfill scheduler.                                                           |
| **slurm.sdiag.backfill.queue\_length\_mean**(gauge)              | Mean count of jobs pending to be processed by backfilling algorithm.*Shown as job*                                             |
| **slurm.sdiag.backfill.total\_cycles**(gauge)                    | Number of backfill scheduling cycles since last reset.                                                                         |
| **slurm.sdiag.backfill.total\_heterogeneous\_components**(gauge) | Number of heterogeneous job components started thanks to backfilling since last Slurm start.                                   |
| **slurm.sdiag.backfill.total\_jobs\_since\_cycle\_start**(gauge) | Total backfilled jobs since last stats cycle restart.*Shown as job*                                                            |
| **slurm.sdiag.backfill.total\_jobs\_since\_start**(gauge)        | Total backfilled jobs since last slurm restart.*Shown as job*                                                                  |
| **slurm.sdiag.cycles\_per\_minute**(gauge)                       | Scheduling executions per minute.                                                                                              |
| **slurm.sdiag.dbd\_agent\_queue\_size**(gauge)                   | DBD Agent message queue size for SlurmDBD.*Shown as message*                                                                   |
| **slurm.sdiag.enabled**(gauge)                                   | Shows whether we're collecting sdiag metrics or not for this host.                                                             |
| **slurm.sdiag.jobs\_canceled**(gauge)                            | Number of jobs canceled since last reset.*Shown as job*                                                                        |
| **slurm.sdiag.jobs\_completed**(gauge)                           | Number of jobs completed since last reset.*Shown as job*                                                                       |
| **slurm.sdiag.jobs\_failed**(gauge)                              | Number of jobs failed since last reset.*Shown as job*                                                                          |
| **slurm.sdiag.jobs\_pending**(gauge)                             | Number of jobs pending since last reset.*Shown as job*                                                                         |
| **slurm.sdiag.jobs\_running**(gauge)                             | Number of jobs running since last reset.*Shown as job*                                                                         |
| **slurm.sdiag.jobs\_started**(gauge)                             | Number of jobs started since last reset.*Shown as job*                                                                         |
| **slurm.sdiag.jobs\_submitted**(gauge)                           | Number of jobs submitted since last reset.*Shown as job*                                                                       |
| **slurm.sdiag.last\_cycle**(gauge)                               | Time in microseconds for last scheduling cycle.*Shown as microsecond*                                                          |
| **slurm.sdiag.last\_queue\_length**(gauge)                       | Length of jobs pending queue.*Shown as job*                                                                                    |
| **slurm.sdiag.max\_cycle**(gauge)                                | Maximum time in microseconds for any scheduling cycle since last reset.*Shown as microsecond*                                  |
| **slurm.sdiag.mean\_cycle**(gauge)                               | Mean time in microseconds for all scheduling cycles since last reset.*Shown as microsecond*                                    |
| **slurm.sdiag.mean\_depth\_cycle**(gauge)                        | Mean of cycle depth. Depth means number of jobs processed in a scheduling cycle.*Shown as job*                                 |
| **slurm.sdiag.server\_thread\_count**(gauge)                     | The number of current active slurmctld threads.*Shown as thread*                                                               |
| **slurm.sdiag.total\_cycles**(gauge)                             | The total run time in microseconds for all scheduling cycles since the last reset.*Shown as microsecond*                       |
| **slurm.seff.cpu\_efficiency**(gauge)                            | The CPU efficiency of the job.*Shown as percent*                                                                               |
| **slurm.seff.cpu\_utilized**(gauge)                              | The CPU utilized by the job.*Shown as second*                                                                                  |
| **slurm.seff.memory\_efficiency**(gauge)                         | The memory efficiency of the job.*Shown as percent*                                                                            |
| **slurm.seff.memory\_utilized\_mb**(gauge)                       | The memory utilized by the job.*Shown as megabyte*                                                                             |
| **slurm.share.effective\_usage**(gauge)                          | The association's usage normalized with its parent.                                                                            |
| **slurm.share.fair\_share**(gauge)                               | The Fair-Share factor, based on a user or account's assigned shares and the effective usage charged to them or their accounts. |
| **slurm.share.level\_fs**(gauge)                                 | This is the association's fairshare value compared to its siblings,calculated as norm_shares / effective_usage.                |
| **slurm.share.norm\_shares**(gauge)                              | The shares assigned to the user or account normalized to the total number of assigned shares.                                  |
| **slurm.share.norm\_usage**(gauge)                               | The Raw Usage normalized to the total number of tres-seconds of all jobs run on the cluster.                                   |
| **slurm.share.raw\_shares**(gauge)                               | The raw shares assigned to the user or account.                                                                                |
| **slurm.share.raw\_usage**(gauge)                                | The number of tres-seconds (cpu-seconds if TRESBillingWeights is not defined) of all the jobs charged to the account or user.  |
| **slurm.sinfo.node.enabled**(gauge)                              | Shows whether we're collecting node metrics or not for this host.                                                              |
| **slurm.sinfo.partition.enabled**(gauge)                         | Shows whether we're collecting partition metrics or not for this host.                                                         |
| **slurm.sinfo.squeue.enabled**(gauge)                            | Shows whether we're collecting squeue metrics or not for this host.                                                            |
| **slurm.squeue.job.info**(gauge)                                 | Information about the Slurm job in squeue.                                                                                     |
| **slurm.sshare.enabled**(gauge)                                  | Shows whether we're collecting sshare metrics or not for this host.                                                            |

### Events{% #events %}

The Slurm integration does not include any events.

## Troubleshooting{% #troubleshooting %}

Need help? Contact [Datadog support](https://docs.datadoghq.com/help/).