- 필수 기능
- 시작하기
- Glossary
- 표준 속성
- Guides
- Agent
- 통합
- 개방형텔레메트리
- 개발자
- Administrator's Guide
- API
- Datadog Mobile App
- CoScreen
- Cloudcraft
- 앱 내
- 서비스 관리
- 인프라스트럭처
- 애플리케이션 성능
- APM
- Continuous Profiler
- 스팬 시각화
- 데이터 스트림 모니터링
- 데이터 작업 모니터링
- 디지털 경험
- 소프트웨어 제공
- 보안
- AI Observability
- 로그 관리
- 관리
Supported OS
This check monitors Slurm through the Datadog Agent.
Slurm (Simple Linux Utility for Resource Management) is an open-source workload manager used to schedule and manage jobs on large-scale compute clusters. It allocates resources, monitors job queues, and ensures efficient execution of parallel and batch jobs in high-performance computing environments.
The check gathers metrics from slurmctld
by executing and parsing the output of several command-line binaries, including sinfo
, squeue
, sacct
, sdiag
, and sshare
. These commands provide detailed information on resource availability, job queues, accounting, diagnostics, and share usage in a Slurm-managed cluster.
Follow the instructions below to install and configure this check for an Agent running on a host. Since the Agent requires direct access to the various Slurm binaries, monitoring Slurm in containerized environments is not recommended.
Note: This check was tested on Slurm version 21.08.0.
The Slurm check is included in the Datadog Agent package. No additional installation is needed on your server.
Ensure that the dd-agent user has execute permissions on the relevant command binaries and the necessary permissions to access the directories where these binaries are located.
Edit the slurm.d/conf.yaml
file, in the conf.d/
folder at the root of your Agent’s configuration directory to start collecting your Slurm data. See the sample slurm.d/conf.yaml for all available configuration options.
init_config:
## Customize this part if the binaries are not located in the /usr/bin/ directory
## @param slurm_binaries_dir - string - optional - default: /usr/bin/
## The directory in which all the Slurm binaries are located. These are mainly:
## sinfo, sacct, sdiag, sshare and sdiag.
slurm_binaries_dir: /usr/bin/
instances:
-
## Configure these parameters to select which data the integration collects.
## @param collect_sinfo_stats - boolean - optional - default: true
## Whether or not to collect statistics from the sinfo command.
#
collect_sinfo_stats: true
## @param collect_sdiag_stats - boolean - optional - default: true
## Whether or not to collect statistics from the sdiag command.
#
collect_sdiag_stats: true
## @param collect_squeue_stats - boolean - optional - default: true
## Whether or not to collect statistics from the squeue command.
#
collect_squeue_stats: true
## @param collect_sacct_stats - boolean - optional - default: true
## Whether or not to collect statistics from the sacct command.
#
collect_sacct_stats: true
## @param collect_sshare_stats - boolean - optional - default: true
## Whether or not to collect statistics from the sshare command.
#
collect_sshare_stats: true
## @param collect_gpu_stats - boolean - optional - default: false
## Whether or not to collect GPU statistics when Slurm is configured to use GPUs using sinfo.
#
collect_gpu_stats: true
## @param sinfo_collection_level - integer - optional - default: 1
## The level of detail to collect from the sinfo command. The default is 'basic'. Available options are 1, 2 and
## 3. Level 1 collects data only for partitions. Level 2 collects data from individual nodes. Level 3
## collects data from from individual nodes as well but is more verbose and includes data such as CPU and
## memory usage as reported from the OS, as well as additional tags.
#
sinfo_collection_level: 1
Run the Agent’s status subcommand and look for slurm
under the Checks section.
slurm.node.cpu.allocated (gauge) | Number of CPUs allocated on the node for job-related tasks. Shown as cpu |
slurm.node.cpu.idle (gauge) | Number of idle CPUs on the node. Shown as cpu |
slurm.node.cpu.other (gauge) | Number of CPUs performing other or non-job-related tasks on the node. Shown as cpu |
slurm.node.cpu.total (gauge) | Total number of CPUs on the node. Shown as cpu |
slurm.node.cpu_load (gauge) | CPU load on the node as reported by the OS. |
slurm.node.free_mem (gauge) | Free memory on the node as reported by the OS. Shown as megabyte |
slurm.node.gpu_total (gauge) | Total number of GPUs on the node. |
slurm.node.gpu_used (gauge) | Number of GPUs used on the node. |
slurm.node.info (gauge) | Information about the Slurm node. |
slurm.node.tmp_disk (gauge) | Temporary disk space on the node as reported by the OS. Shown as megabyte |
slurm.partition.cpu.allocated (gauge) | Number of CPUs allocated on the partition for job-related tasks. Shown as cpu |
slurm.partition.cpu.idle (gauge) | Number of idle CPUs on the partition. Shown as cpu |
slurm.partition.cpu.other (gauge) | Number of CPUs performing other or non-job-related tasks on the partition. Shown as cpu |
slurm.partition.cpu.total (gauge) | Total number of CPUs on the partition. Shown as cpu |
slurm.partition.gpu_total (gauge) | Total number of GPUs on the partition. |
slurm.partition.gpu_used (gauge) | Number of GPUs used on the partition. |
slurm.partition.info (gauge) | Information about the Slurm partition. |
slurm.partition.nodes.count (gauge) | Number of nodes in the partition. Shown as node |
slurm.sacct.enabled (gauge) | Shows whether we're collecting sacct metrics or not for this host. |
slurm.sacct.job.duration (gauge) | Duration of the job in seconds. Shown as second |
slurm.sacct.job.info (gauge) | Information about the Slurm job in sacct. |
slurm.sacct.slurm_job_avgcpu (gauge) | Average (system + user) CPU time of all tasks in job. Shown as second |
slurm.sacct.slurm_job_avgrss (gauge) | Average resident set size of all tasks in job. |
slurm.sacct.slurm_job_cputime (gauge) | Time used (Elapsed time * CPU count) by a job or step in cpu-seconds. Shown as second |
slurm.sacct.slurm_job_maxrss (gauge) | Maximum resident set size of all tasks in job. |
slurm.sdiag.agent_count (gauge) | Number of agent threads. Shown as thread |
slurm.sdiag.agent_queue_size (gauge) | Number of enqueued outgoing RPC requests in an internal retry list. Shown as request |
slurm.sdiag.agent_thread_count (gauge) | Total count of active threads created by all the agent threads. Shown as thread |
slurm.sdiag.backfill.depth_mean (gauge) | Mean count of jobs processed during all backfilling scheduling cycles since last reset. Shown as job |
slurm.sdiag.backfill.depth_mean_try_depth (gauge) | The subset of Depth Mean that the backfill scheduler attempted to schedule. Shown as job |
slurm.sdiag.backfill.last_cycle (gauge) | Time in microseconds of last backfill scheduling cycle. Shown as microsecond |
slurm.sdiag.backfill.last_depth_cycle (gauge) | Number of processed jobs during last backfilling scheduling cycle. Shown as job |
slurm.sdiag.backfill.last_depth_try_schedule (gauge) | Number of processed jobs during last backfilling scheduling cycle. Shown as job |
slurm.sdiag.backfill.last_queue_length (gauge) | Number of jobs pending to be processed by backfilling algorithm. Shown as job |
slurm.sdiag.backfill.last_table_size (gauge) | Number of different time slots tested by the backfill scheduler in its last iteration. |
slurm.sdiag.backfill.max_cycle (gauge) | Time in microseconds of maximum backfill scheduling cycle execution since last reset. Shown as microsecond |
slurm.sdiag.backfill.mean_cycle (gauge) | Mean time in microseconds of backfilling scheduling cycles since last reset. Shown as microsecond |
slurm.sdiag.backfill.mean_table_size (gauge) | Mean count of different time slots tested by the backfill scheduler. |
slurm.sdiag.backfill.queue_length_mean (gauge) | Mean count of jobs pending to be processed by backfilling algorithm. Shown as job |
slurm.sdiag.backfill.total_cycles (gauge) | Number of backfill scheduling cycles since last reset. |
slurm.sdiag.backfill.total_heterogeneous_components (gauge) | Number of heterogeneous job components started thanks to backfilling since last Slurm start. |
slurm.sdiag.backfill.total_jobs_since_cycle_start (gauge) | Total backfilled jobs since last stats cycle restart. Shown as job |
slurm.sdiag.backfill.total_jobs_since_start (gauge) | Total backfilled jobs since last slurm restart. Shown as job |
slurm.sdiag.cycles_per_minute (gauge) | Scheduling executions per minute. |
slurm.sdiag.dbd_agent_queue_size (gauge) | DBD Agent message queue size for SlurmDBD. Shown as message |
slurm.sdiag.enabled (gauge) | Shows whether we're collecting sdiag metrics or not for this host. |
slurm.sdiag.jobs_canceled (gauge) | Number of jobs canceled since last reset. Shown as job |
slurm.sdiag.jobs_completed (gauge) | Number of jobs completed since last reset. Shown as job |
slurm.sdiag.jobs_failed (gauge) | Number of jobs failed since last reset. Shown as job |
slurm.sdiag.jobs_pending (gauge) | Number of jobs pending since last reset. Shown as job |
slurm.sdiag.jobs_running (gauge) | Number of jobs running since last reset. Shown as job |
slurm.sdiag.jobs_started (gauge) | Number of jobs started since last reset. Shown as job |
slurm.sdiag.jobs_submitted (gauge) | Number of jobs submitted since last reset. Shown as job |
slurm.sdiag.last_cycle (gauge) | Time in microseconds for last scheduling cycle. Shown as microsecond |
slurm.sdiag.last_queue_length (gauge) | Length of jobs pending queue. Shown as job |
slurm.sdiag.max_cycle (gauge) | Maximum time in microseconds for any scheduling cycle since last reset. Shown as microsecond |
slurm.sdiag.mean_cycle (gauge) | Mean time in microseconds for all scheduling cycles since last reset. Shown as microsecond |
slurm.sdiag.mean_depth_cycle (gauge) | Mean of cycle depth. Depth means number of jobs processed in a scheduling cycle. Shown as job |
slurm.sdiag.server_thread_count (gauge) | The number of current active slurmctld threads. Shown as thread |
slurm.sdiag.total_cycles (gauge) | The total run time in microseconds for all scheduling cycles since the last reset. Shown as microsecond |
slurm.share.effective_usage (gauge) | The association's usage normalized with its parent. |
slurm.share.fair_share (gauge) | The Fair-Share factor, based on a user or account's assigned shares and the effective usage charged to them or their accounts. |
slurm.share.level_fs (gauge) | This is the association's fairshare value compared to its siblings,calculated as normshares / effectiveusage. |
slurm.share.norm_shares (gauge) | The shares assigned to the user or account normalized to the total number of assigned shares. |
slurm.share.norm_usage (gauge) | The Raw Usage normalized to the total number of tres-seconds of all jobs run on the cluster. |
slurm.share.raw_shares (gauge) | The raw shares assigned to the user or account. |
slurm.share.raw_usage (gauge) | The number of tres-seconds (cpu-seconds if TRESBillingWeights is not defined) of all the jobs charged to the account or user. |
slurm.sinfo.node.enabled (gauge) | Shows whether we're collecting node metrics or not for this host. |
slurm.sinfo.partition.enabled (gauge) | Shows whether we're collecting partition metrics or not for this host. |
slurm.sinfo.squeue.enabled (gauge) | Shows whether we're collecting squeue metrics or not for this host. |
slurm.squeue.job.info (gauge) | Information about the Slurm job in squeue. |
slurm.sshare.enabled (gauge) | Shows whether we're collecting sshare metrics or not for this host. |
The Slurm integration does not include any events.
Need help? Contact Datadog support.