Slurm

Supported OS Linux

통합 버전1.0.3

이 페이지는 아직 영어로 제공되지 않습니다. 번역 작업 중입니다.
현재 번역 프로젝트에 대한 질문이나 피드백이 있으신 경우 언제든지 연락주시기 바랍니다.

Overview

This check monitors Slurm through the Datadog Agent.

Slurm (Simple Linux Utility for Resource Management) is an open-source workload manager used to schedule and manage jobs on large-scale compute clusters. It allocates resources, monitors job queues, and ensures efficient execution of parallel and batch jobs in high-performance computing environments.

The check gathers metrics from slurmctld by executing and parsing the output of several command-line binaries, including sinfo, squeue, sacct, sdiag, and sshare. These commands provide detailed information on resource availability, job queues, accounting, diagnostics, and share usage in a Slurm-managed cluster.

Setup

Follow the instructions below to install and configure this check for an Agent running on a host. Since the Agent requires direct access to the various Slurm binaries, monitoring Slurm in containerized environments is not recommended.

Note: This check was tested on Slurm version 21.08.0.

Installation

The Slurm check is included in the Datadog Agent package. No additional installation is needed on your server.

Configuration

Ensure that the dd-agent user has execute permissions on the relevant command binaries and the necessary permissions to access the directories where these binaries are located.
Edit the slurm.d/conf.yaml file, in the conf.d/ folder at the root of your Agent’s configuration directory to start collecting your Slurm data. See the sample slurm.d/conf.yaml for all available configuration options.

init_config:

    ## Customize this part if the binaries are not located in the /usr/bin/ directory
    ## @param slurm_binaries_dir - string - optional - default: /usr/bin/
    ## The directory in which all the Slurm binaries are located. These are mainly:
    ## sinfo, sacct, sdiag, sshare and sdiag.

    slurm_binaries_dir: /usr/bin/

instances:

  -
    ## Configure these parameters to select which data the integration collects.
    ## @param collect_sinfo_stats - boolean - optional - default: true
    ## Whether or not to collect statistics from the sinfo command.
    #
    collect_sinfo_stats: true

    ## @param collect_sdiag_stats - boolean - optional - default: true
    ## Whether or not to collect statistics from the sdiag command.
    #
    collect_sdiag_stats: true

    ## @param collect_squeue_stats - boolean - optional - default: true
    ## Whether or not to collect statistics from the squeue command.
    #
    collect_squeue_stats: true

    ## @param collect_sacct_stats - boolean - optional - default: true
    ## Whether or not to collect statistics from the sacct command.
    #
    collect_sacct_stats: true

    ## @param collect_sshare_stats - boolean - optional - default: true
    ## Whether or not to collect statistics from the sshare command.
    #
    collect_sshare_stats: true

    ## @param collect_gpu_stats - boolean - optional - default: false
    ## Whether or not to collect GPU statistics when Slurm is configured to use GPUs using sinfo.
    #
    collect_gpu_stats: true

    ## @param sinfo_collection_level - integer - optional - default: 1
    ## The level of detail to collect from the sinfo command. The default is 'basic'. Available options are 1, 2 and
    ## 3. Level 1 collects data only for partitions. Level 2 collects data from individual nodes. Level 3 
    ## collects data from from individual nodes as well but is more verbose and includes data such as CPU and 
    ## memory usage as reported from the OS, as well as additional tags.
    #
    sinfo_collection_level: 1

Restart the Agent.

Validation

Run the Agent’s status subcommand and look for slurm under the Checks section.

Data Collected

Metrics

slurm.node.cpu.allocated (gauge)	Number of CPUs allocated on the node for job-related tasks. Shown as cpu
slurm.node.cpu.idle (gauge)	Number of idle CPUs on the node. Shown as cpu
slurm.node.cpu.other (gauge)	Number of CPUs performing other or non-job-related tasks on the node. Shown as cpu
slurm.node.cpu.total (gauge)	Total number of CPUs on the node. Shown as cpu
slurm.node.cpu_load (gauge)	CPU load on the node as reported by the OS.
slurm.node.free_mem (gauge)	Free memory on the node as reported by the OS. Shown as megabyte
slurm.node.gpu_total (gauge)	Total number of GPUs on the node.
slurm.node.gpu_used (gauge)	Number of GPUs used on the node.
slurm.node.info (gauge)	Information about the Slurm node.
slurm.node.tmp_disk (gauge)	Temporary disk space on the node as reported by the OS. Shown as megabyte
slurm.partition.cpu.allocated (gauge)	Number of CPUs allocated on the partition for job-related tasks. Shown as cpu
slurm.partition.cpu.idle (gauge)	Number of idle CPUs on the partition. Shown as cpu
slurm.partition.cpu.other (gauge)	Number of CPUs performing other or non-job-related tasks on the partition. Shown as cpu
slurm.partition.cpu.total (gauge)	Total number of CPUs on the partition. Shown as cpu
slurm.partition.gpu_total (gauge)	Total number of GPUs on the partition.
slurm.partition.gpu_used (gauge)	Number of GPUs used on the partition.
slurm.partition.info (gauge)	Information about the Slurm partition.
slurm.partition.nodes.count (gauge)	Number of nodes in the partition. Shown as node
slurm.sacct.enabled (gauge)	Shows whether we're collecting sacct metrics or not for this host.
slurm.sacct.job.duration (gauge)	Duration of the job in seconds. Shown as second
slurm.sacct.job.info (gauge)	Information about the Slurm job in sacct.
slurm.sacct.slurm_job_avgcpu (gauge)	Average (system + user) CPU time of all tasks in job. Shown as second
slurm.sacct.slurm_job_avgrss (gauge)	Average resident set size of all tasks in job.
slurm.sacct.slurm_job_cputime (gauge)	Time used (Elapsed time * CPU count) by a job or step in cpu-seconds. Shown as second
slurm.sacct.slurm_job_maxrss (gauge)	Maximum resident set size of all tasks in job.
slurm.sdiag.agent_count (gauge)	Number of agent threads. Shown as thread
slurm.sdiag.agent_queue_size (gauge)	Number of enqueued outgoing RPC requests in an internal retry list. Shown as request
slurm.sdiag.agent_thread_count (gauge)	Total count of active threads created by all the agent threads. Shown as thread
slurm.sdiag.backfill.depth_mean (gauge)	Mean count of jobs processed during all backfilling scheduling cycles since last reset. Shown as job
slurm.sdiag.backfill.depth_mean_try_depth (gauge)	The subset of Depth Mean that the backfill scheduler attempted to schedule. Shown as job
slurm.sdiag.backfill.last_cycle (gauge)	Time in microseconds of last backfill scheduling cycle. Shown as microsecond
slurm.sdiag.backfill.last_depth_cycle (gauge)	Number of processed jobs during last backfilling scheduling cycle. Shown as job
slurm.sdiag.backfill.last_depth_try_schedule (gauge)	Number of processed jobs during last backfilling scheduling cycle. Shown as job
slurm.sdiag.backfill.last_queue_length (gauge)	Number of jobs pending to be processed by backfilling algorithm. Shown as job
slurm.sdiag.backfill.last_table_size (gauge)	Number of different time slots tested by the backfill scheduler in its last iteration.
slurm.sdiag.backfill.max_cycle (gauge)	Time in microseconds of maximum backfill scheduling cycle execution since last reset. Shown as microsecond
slurm.sdiag.backfill.mean_cycle (gauge)	Mean time in microseconds of backfilling scheduling cycles since last reset. Shown as microsecond
slurm.sdiag.backfill.mean_table_size (gauge)	Mean count of different time slots tested by the backfill scheduler.
slurm.sdiag.backfill.queue_length_mean (gauge)	Mean count of jobs pending to be processed by backfilling algorithm. Shown as job
slurm.sdiag.backfill.total_cycles (gauge)	Number of backfill scheduling cycles since last reset.
slurm.sdiag.backfill.total_heterogeneous_components (gauge)	Number of heterogeneous job components started thanks to backfilling since last Slurm start.
slurm.sdiag.backfill.total_jobs_since_cycle_start (gauge)	Total backfilled jobs since last stats cycle restart. Shown as job
slurm.sdiag.backfill.total_jobs_since_start (gauge)	Total backfilled jobs since last slurm restart. Shown as job
slurm.sdiag.cycles_per_minute (gauge)	Scheduling executions per minute.
slurm.sdiag.dbd_agent_queue_size (gauge)	DBD Agent message queue size for SlurmDBD. Shown as message
slurm.sdiag.enabled (gauge)	Shows whether we're collecting sdiag metrics or not for this host.
slurm.sdiag.jobs_canceled (gauge)	Number of jobs canceled since last reset. Shown as job
slurm.sdiag.jobs_completed (gauge)	Number of jobs completed since last reset. Shown as job
slurm.sdiag.jobs_failed (gauge)	Number of jobs failed since last reset. Shown as job
slurm.sdiag.jobs_pending (gauge)	Number of jobs pending since last reset. Shown as job
slurm.sdiag.jobs_running (gauge)	Number of jobs running since last reset. Shown as job
slurm.sdiag.jobs_started (gauge)	Number of jobs started since last reset. Shown as job
slurm.sdiag.jobs_submitted (gauge)	Number of jobs submitted since last reset. Shown as job
slurm.sdiag.last_cycle (gauge)	Time in microseconds for last scheduling cycle. Shown as microsecond
slurm.sdiag.last_queue_length (gauge)	Length of jobs pending queue. Shown as job
slurm.sdiag.max_cycle (gauge)	Maximum time in microseconds for any scheduling cycle since last reset. Shown as microsecond
slurm.sdiag.mean_cycle (gauge)	Mean time in microseconds for all scheduling cycles since last reset. Shown as microsecond
slurm.sdiag.mean_depth_cycle (gauge)	Mean of cycle depth. Depth means number of jobs processed in a scheduling cycle. Shown as job
slurm.sdiag.server_thread_count (gauge)	The number of current active slurmctld threads. Shown as thread
slurm.sdiag.total_cycles (gauge)	The total run time in microseconds for all scheduling cycles since the last reset. Shown as microsecond
slurm.share.effective_usage (gauge)	The association's usage normalized with its parent.
slurm.share.fair_share (gauge)	The Fair-Share factor, based on a user or account's assigned shares and the effective usage charged to them or their accounts.
slurm.share.level_fs (gauge)	This is the association's fairshare value compared to its siblings,calculated as normshares / effectiveusage.
slurm.share.norm_shares (gauge)	The shares assigned to the user or account normalized to the total number of assigned shares.
slurm.share.norm_usage (gauge)	The Raw Usage normalized to the total number of tres-seconds of all jobs run on the cluster.
slurm.share.raw_shares (gauge)	The raw shares assigned to the user or account.
slurm.share.raw_usage (gauge)	The number of tres-seconds (cpu-seconds if TRESBillingWeights is not defined) of all the jobs charged to the account or user.
slurm.sinfo.node.enabled (gauge)	Shows whether we're collecting node metrics or not for this host.
slurm.sinfo.partition.enabled (gauge)	Shows whether we're collecting partition metrics or not for this host.
slurm.sinfo.squeue.enabled (gauge)	Shows whether we're collecting squeue metrics or not for this host.
slurm.squeue.job.info (gauge)	Information about the Slurm job in squeue.
slurm.sshare.enabled (gauge)	Shows whether we're collecting sshare metrics or not for this host.

Events

The Slurm integration does not include any events.

Troubleshooting

Need help? Contact Datadog support.