Slurm

Supported OS Linux

통합 버전2.1.0

이 페이지는 아직 영어로 제공되지 않습니다. 번역 작업 중입니다.
현재 번역 프로젝트에 대한 질문이나 피드백이 있으신 경우 언제든지 연락주시기 바랍니다.

Overview

This check monitors Slurm through the Datadog Agent.

Slurm (Simple Linux Utility for Resource Management) is an open-source workload manager used to schedule and manage jobs on large-scale compute clusters. It allocates resources, monitors job queues, and ensures efficient execution of parallel and batch jobs in high-performance computing environments.

The check collects metrics from the head node (slurmctld) by executing and parsing the output of several command-line binaries: sinfo, squeue, sacct, sdiag, and sshare. These commands provide detailed information about resource availability, job queues, accounting, diagnostics, and share usage in a Slurm-managed cluster.

On worker nodes, the check can also collect metrics using scontrol, which provides process IDs (PIDs) and other job information that is not available through the head node.

Minimum Agent version: 7.59.0

Setup

Follow the instructions below to install and configure this check for an Agent running on a host. Since the Agent requires direct access to the various Slurm binaries, monitoring Slurm in containerized environments is not recommended.

Note: This check was tested on Slurm version 21.08.0.

Installation

The Slurm check is included in the Datadog Agent package. No additional installation is needed on your server.

Configuration

Head Node

Ensure that the dd-agent user has execute permissions on the relevant command binaries and the necessary permissions to access the directories where these binaries are located.
Edit the slurm.d/conf.yaml file, in the conf.d/ folder at the root of your Agent’s configuration directory to start collecting your Slurm data. See the sample slurm.d/conf.yaml for all available configuration options.

init_config:

    ## Customize this part if the binaries are not located in the /usr/bin/ directory
    ## @param slurm_binaries_dir - string - optional - default: /usr/bin/
    ## The directory in which all the Slurm binaries are located. These are mainly:
    ## sinfo, sacct, sdiag, sshare and sdiag.

    slurm_binaries_dir: /usr/bin/

instances:

  -
    ## Configure these parameters to select which data the integration collects.
    ## @param collect_sinfo_stats - boolean - optional - default: true
    ## Whether or not to collect statistics from the sinfo command.
    #
    collect_sinfo_stats: true

    ## @param collect_sdiag_stats - boolean - optional - default: true
    ## Whether or not to collect statistics from the sdiag command.
    #
    collect_sdiag_stats: true

    ## @param collect_squeue_stats - boolean - optional - default: true
    ## Whether or not to collect statistics from the squeue command.
    #
    collect_squeue_stats: true

    ## @param collect_sacct_stats - boolean - optional - default: true
    ## Whether or not to collect statistics from the sacct command.
    #
    collect_sacct_stats: true

    ## @param collect_sshare_stats - boolean - optional - default: true
    ## Whether or not to collect statistics from the sshare command.
    #
    collect_sshare_stats: true

    ## @param collect_gpu_stats - boolean - optional - default: false
    ## Whether or not to collect GPU statistics when Slurm is configured to use GPUs using sinfo.
    #
    collect_gpu_stats: true

    ## @param sinfo_collection_level - integer - optional - default: 1
    ## The level of detail to collect from the sinfo command. The default is 'basic'. Available options are 1, 2 and
    ## 3. Level 1 collects data only for partitions. Level 2 collects data from individual nodes. Level 3 
    ## collects data from from individual nodes as well but is more verbose and includes data such as CPU and 
    ## memory usage as reported from the OS, as well as additional tags.
    #
    sinfo_collection_level: 3

    ## @param collect_scontrol_stats - boolean - optional - default: false
    ## Whether or not to collect statistics from the scontrol command. This is mainly used in the worker 
    ## node to collect the list of running jobs along with their PIDs.
    collect_scontrol_stats: false # This should only be set on worker nodes and not the head node

Restart the Agent.

Worker Nodes

The slurm.scontrol.job.info metric can only be collected from worker nodes. This metric enables the submission of important tags that can be used to monitor the resource consumption of specific job steps.

Ensure that the dd-agent user has execute permissions on the relevant scontrol binaries and the necessary permissions to access the directories where these binaries are located.
Edit the slurm.d/conf.yaml file, in the conf.d/ folder at the root of your Agent’s configuration directory to start collecting your Slurm data. See the sample slurm.d/conf.yaml for all available configuration options.

init_config:

    ## Customize this part if the binaries are not located in the /usr/bin/ directory
    ## @param slurm_binaries_dir - string - optional - default: /usr/bin/
    ## The directory in which all the Slurm binaries are located. These are mainly:
    ## sinfo, sacct, sdiag, and sshare.

    slurm_binaries_dir: /usr/bin/

instances:

  - 
    ## @param collect_scontrol_stats - boolean - optional - default: false
    ## Whether or not to collect statistics from the scontrol command. This is mainly used in the worker 
    ## node to collect the list of running jobs along with their PIDs.
    collect_scontrol_stats: true

    # The rest of these settings need to be turned off on the worker node because the information is specific
    # to the head node and isn't retrievable on the worker node.
    collect_sinfo_stats: false
    collect_sdiag_stats: false
    collect_squeue_stats: false
    collect_sacct_stats: false
    collect_sshare_stats: false
    collect_gpu_stats: false
    sinfo_collection_level: 1

Restart the Agent.

Validation

Run the Agent’s status subcommand and look for slurm under the Checks section.

Data Collected

Metrics


slurm.node.alloc_mem (gauge)	Number of megabytes allocated on the node. Shown as megabyte
slurm.node.cpu.allocated (gauge)	Number of CPUs allocated on the node for job-related tasks. Shown as cpu
slurm.node.cpu.idle (gauge)	Number of idle CPUs on the node. Shown as cpu
slurm.node.cpu.other (gauge)	Number of CPUs performing other or non-job-related tasks on the node. Shown as cpu
slurm.node.cpu.total (gauge)	Total number of CPUs on the node. Shown as cpu
slurm.node.cpu_load (gauge)	CPU load on the node as reported by the OS.
slurm.node.free_mem (gauge)	Free memory on the node as reported by the OS. Shown as megabyte
slurm.node.gpu_total (gauge)	Total number of GPUs on the node.
slurm.node.gpu_used (gauge)	Number of GPUs used on the node.
slurm.node.info (gauge)	Information about the Slurm node.
slurm.node.memory (gauge)	Total memory on the node as reported by the OS. Shown as megabyte
slurm.node.tmp_disk (gauge)	Temporary disk space on the node as reported by the OS. Shown as megabyte
slurm.partition.cpu.allocated (gauge)	(Deprecated) Number of CPUs allocated on the partition for job-related tasks. Shown as cpu
slurm.partition.cpu.idle (gauge)	(Deprecated) Number of idle CPUs on the partition. Shown as cpu
slurm.partition.cpu.other (gauge)	(Deprecated) Number of CPUs performing other or non-job-related tasks on the partition. Shown as cpu
slurm.partition.cpu.total (gauge)	(Deprecated) Total number of CPUs on the partition. Shown as cpu
slurm.partition.gpu_total (gauge)	Total number of GPUs on the partition.
slurm.partition.gpu_used (gauge)	Number of GPUs used on the partition.
slurm.partition.info (gauge)	Information about the Slurm partition.
slurm.partition.node.allocated (gauge)	Number of nodes allocated on the partition for job-related tasks. Shown as node
slurm.partition.node.idle (gauge)	Number of idle nodes on the partition. Shown as node
slurm.partition.node.other (gauge)	Number of nodes performing other or non-job-related tasks on the partition. Shown as node
slurm.partition.node.total (gauge)	Total number of nodes on the partition. Shown as node
slurm.partition.nodes.count (gauge)	Number of nodes in the partition. Shown as node
slurm.sacct.enabled (gauge)	Shows whether we’re collecting sacct metrics or not for this host.
slurm.sacct.job.duration (gauge)	Duration of the job in seconds. Shown as second
slurm.sacct.job.info (gauge)	Information about the Slurm job in sacct.
slurm.sacct.slurm_job_ave_disk_read (gauge)	Average number of bytes read from disk by a job or step. Shown as byte
slurm.sacct.slurm_job_avgcpu (gauge)	Average (system + user) CPU time of all tasks in job. Shown as second
slurm.sacct.slurm_job_avgrss (gauge)	Average resident set size of all tasks in job. Shown as byte
slurm.sacct.slurm_job_cputime (gauge)	Time used (Elapsed time * CPU count) by a job or step in cpu-seconds. Shown as second
slurm.sacct.slurm_job_max_disk_read (gauge)	Maximum number of bytes read from disk by a job or step. Shown as byte
slurm.sacct.slurm_job_maxrss (gauge)	Maximum resident set size of all tasks in job. Shown as byte
slurm.sacct.slurm_job_maxvm (gauge)	Maximum virtual memory size of all tasks in job. Shown as byte
slurm.scontrol.job.info (gauge)	Status of running jobs on worker node. Shown as job
slurm.sdiag.agent_count (gauge)	Number of agent threads. Shown as thread
slurm.sdiag.agent_queue_size (gauge)	Number of enqueued outgoing RPC requests in an internal retry list. Shown as request
slurm.sdiag.agent_thread_count (gauge)	Total count of active threads created by all the agent threads. Shown as thread
slurm.sdiag.backfill.depth_mean (gauge)	Mean count of jobs processed during all backfilling scheduling cycles since last reset. Shown as job
slurm.sdiag.backfill.depth_mean_try_depth (gauge)	The subset of Depth Mean that the backfill scheduler attempted to schedule. Shown as job
slurm.sdiag.backfill.last_cycle (gauge)	Time in microseconds of last backfill scheduling cycle. Shown as microsecond
slurm.sdiag.backfill.last_cycle_seconds_ago (gauge)	Time in seconds since the last scheduling cycle. Shown as second
slurm.sdiag.backfill.last_depth_cycle (gauge)	Number of processed jobs during last backfilling scheduling cycle. Shown as job
slurm.sdiag.backfill.last_depth_try_schedule (gauge)	Number of processed jobs during last backfilling scheduling cycle. Shown as job
slurm.sdiag.backfill.last_queue_length (gauge)	Number of jobs pending to be processed by backfilling algorithm. Shown as job
slurm.sdiag.backfill.last_table_size (gauge)	Number of different time slots tested by the backfill scheduler in its last iteration.
slurm.sdiag.backfill.max_cycle (gauge)	Time in microseconds of maximum backfill scheduling cycle execution since last reset. Shown as microsecond
slurm.sdiag.backfill.mean_cycle (gauge)	Mean time in microseconds of backfilling scheduling cycles since last reset. Shown as microsecond
slurm.sdiag.backfill.mean_table_size (gauge)	Mean count of different time slots tested by the backfill scheduler.
slurm.sdiag.backfill.queue_length_mean (gauge)	Mean count of jobs pending to be processed by backfilling algorithm. Shown as job
slurm.sdiag.backfill.total_cycles (gauge)	Number of backfill scheduling cycles since last reset.
slurm.sdiag.backfill.total_heterogeneous_components (gauge)	Number of heterogeneous job components started thanks to backfilling since last Slurm start.
slurm.sdiag.backfill.total_jobs_since_cycle_start (gauge)	Total backfilled jobs since last stats cycle restart. Shown as job
slurm.sdiag.backfill.total_jobs_since_start (gauge)	Total backfilled jobs since last slurm restart. Shown as job
slurm.sdiag.cycles_per_minute (gauge)	Scheduling executions per minute.
slurm.sdiag.dbd_agent_queue_size (gauge)	DBD Agent message queue size for SlurmDBD. Shown as message
slurm.sdiag.enabled (gauge)	Shows whether we’re collecting sdiag metrics or not for this host.
slurm.sdiag.jobs_canceled (gauge)	Number of jobs canceled since last reset. Shown as job
slurm.sdiag.jobs_completed (gauge)	Number of jobs completed since last reset. Shown as job
slurm.sdiag.jobs_failed (gauge)	Number of jobs failed since last reset. Shown as job
slurm.sdiag.jobs_pending (gauge)	Number of jobs pending since last reset. Shown as job
slurm.sdiag.jobs_running (gauge)	Number of jobs running since last reset. Shown as job
slurm.sdiag.jobs_started (gauge)	Number of jobs started since last reset. Shown as job
slurm.sdiag.jobs_submitted (gauge)	Number of jobs submitted since last reset. Shown as job
slurm.sdiag.last_cycle (gauge)	Time in microseconds for last scheduling cycle. Shown as microsecond
slurm.sdiag.last_queue_length (gauge)	Length of jobs pending queue. Shown as job
slurm.sdiag.max_cycle (gauge)	Maximum time in microseconds for any scheduling cycle since last reset. Shown as microsecond
slurm.sdiag.mean_cycle (gauge)	Mean time in microseconds for all scheduling cycles since last reset. Shown as microsecond
slurm.sdiag.mean_depth_cycle (gauge)	Mean of cycle depth. Depth means number of jobs processed in a scheduling cycle. Shown as job
slurm.sdiag.server_thread_count (gauge)	The number of current active slurmctld threads. Shown as thread
slurm.sdiag.total_cycles (gauge)	The total run time in microseconds for all scheduling cycles since the last reset. Shown as microsecond
slurm.seff.cpu_efficiency (gauge)	The CPU efficiency of the job. Shown as percent
slurm.seff.cpu_utilized (gauge)	The CPU utilized by the job. Shown as second
slurm.seff.memory_efficiency (gauge)	The memory efficiency of the job. Shown as percent
slurm.seff.memory_utilized_mb (gauge)	The memory utilized by the job. Shown as megabyte
slurm.share.effective_usage (gauge)	The association’s usage normalized with its parent.
slurm.share.fair_share (gauge)	The Fair-Share factor, based on a user or account’s assigned shares and the effective usage charged to them or their accounts.
slurm.share.level_fs (gauge)	This is the association’s fairshare value compared to its siblings,calculated as norm_shares / effective_usage.
slurm.share.norm_shares (gauge)	The shares assigned to the user or account normalized to the total number of assigned shares.
slurm.share.norm_usage (gauge)	The Raw Usage normalized to the total number of tres-seconds of all jobs run on the cluster.
slurm.share.raw_shares (gauge)	The raw shares assigned to the user or account.
slurm.share.raw_usage (gauge)	The number of tres-seconds (cpu-seconds if TRESBillingWeights is not defined) of all the jobs charged to the account or user.
slurm.sinfo.node.enabled (gauge)	Shows whether we’re collecting node metrics or not for this host.
slurm.sinfo.partition.enabled (gauge)	Shows whether we’re collecting partition metrics or not for this host.
slurm.sinfo.squeue.enabled (gauge)	Shows whether we’re collecting squeue metrics or not for this host.
slurm.squeue.job.info (gauge)	Information about the Slurm job in squeue.
slurm.sshare.enabled (gauge)	Shows whether we’re collecting sshare metrics or not for this host.

Events

The Slurm integration does not include any events.

Troubleshooting

Need help? Contact Datadog support.