Slurm

Supported OS Linux

インテグレーションバージョン2.0.1

概要

このチェックは、Datadog Agent を通じて Slurm を監視します。

Slurm (Simple Linux Utility for Resource Management) は、オープンソースのワークロードマネージャーで、大規模な計算クラスター上のジョブのスケジュールと管理に使用されます。リソースを割り当て、ジョブキューを監視し、ハイパフォーマンスコンピューティング環境における並列ジョブおよびバッチジョブの効率的な実行を保証します。

このチェックは、いくつかのコマンドラインバイナリ (sinfo、squeue、sacct、sdiag、sshare) を実行し、その出力を解析することによって、管理ノード (slurmctld) からメトリクスを収集します。これらのコマンドは、Slurm の管理対象クラスターにおけるリソースの可用性、ジョブキュー、アカウンティング、診断、およびシェアの使用状況に関する詳細情報を提供します。

ワーカーノードでは、scontrol を使用してメトリクスを収集することもできます。これは、プロセス ID (PID) と、管理ノードからは取得できないその他のジョブ情報を提供します。

セットアップ

ホスト上で実行されている Agent にこのチェックをインストールして構成するには、以下の手順に従ってください。Agent はさまざまな Slurm バイナリに直接アクセスする必要があるため、コンテナ化された環境での Slurm の監視は推奨されません。

注: このチェックは Slurm バージョン 21.08.0 でテストされています。

インストール

Slurm チェックは Datadog Agent パッケージに含まれています。サーバー上での追加インストールは不要です。

構成

管理ノード

dd-agent ユーザーが、関連するコマンドバイナリの実行権限と、バイナリの置かれたディレクトリにアクセスするのに必要な権限を持っていることを確認してください。
Slurm データの収集を開始するには、Agent の構成ディレクトリのルートにある conf.d/ フォルダーの slurm.d/conf.yaml ファイルを編集します。使用可能なすべての構成オプションの詳細については、サンプル slurm.d/conf.yaml を参照してください。

init_config:

    ## バイナリが /usr/bin/ ディレクトリにない場合は、この部分をカスタマイズします
    ## @param slurm_binaries_dir - 文字列 - オプション - デフォルト: /usr/bin/
    ## すべての Slurm バイナリが置かれているディレクトリ。主なバイナリ:
    ## sinfo、sacct、sdiag、sshare、sdiag.

    slurm_binaries_dir: /usr/bin/

instances:

  -
    ## インテグレーションで収集されるデータを選択するには、以下のパラメータを構成します。
    ## @param collect_sinfo_stats - ブール値 - オプション - デフォルト: true
    ## sinfo コマンドから統計情報を収集するかどうか。
    #
    collect_sinfo_stats: true

    ## @param collect_sdiag_stats - ブール値 - オプション - デフォルト: true
    ## sdiag コマンドから統計情報を収集するかどうか。
    #
    collect_sdiag_stats: true

    ## @param collect_squeue_stats - ブール値 - オプション - デフォルト: true
    ## squeue コマンドから統計情報を収集するかどうか。
    #
    collect_squeue_stats: true

    ## @param collect_sacct_stats - ブール値 - オプション - デフォルト: true
    ## sacct コマンドから統計情報を収集するかどうか。
    #
    collect_sacct_stats: true

    ## @param collect_sshare_stats - ブール値 - オプション - デフォルト: true
    ## sshare コマンドから統計情報を収集するかどうか。
    #
    collect_sshare_stats: true

    ## @param collect_gpu_stats - ブール値 - オプション - デフォルト: false
    ## Slurm が GPU を使用するように構成されている場合に、sinfo を使用して GPU 統計情報を収集するかどうか。
    #
    collect_gpu_stats: true

    ## @param sinfo_collection_level - 整数 - オプション - デフォルト: 1
    ## sinfo コマンドから収集する情報の詳細度。デフォルトは 'basic'。利用可能なオプションは 1、2、3。
    ## レベル 1 はパーティションについてのみデータを収集。レベル 2 は個々のノードからデータを収集。レベル 3 も 
    ## 個々のノードからデータを収集しますが、より詳細で、OS から報告される CPU や 
    ## メモリの使用量などのデータのほか、追加のタグも含まれます。
    #
    sinfo_collection_level: 3

    ## @param collect_scontrol_stats - ブール値 - オプション - デフォルト: false
    ## scontrol コマンドから統計情報を収集するかどうか。これは主に、実行中のジョブの一覧を 
    ## PID とともに収集するためにワーカー ノードで使用されます。
    collect_scontrol_stats: false # これはワーカー ノードにのみ設定し、管理ノードには設定しません

Agent を再起動します。

ワーカーノード

slurm.scontrol.job.info メトリクスはワーカーノードからのみ収集できます。このメトリクスにより、特定のジョブステップのリソース消費を監視するために使用できる重要なタグを送信できます。

dd-agent ユーザーが、関連する scontrol バイナリの実行権限と、バイナリの置かれたディレクトリにアクセスするのに必要な権限を持っていることを確認してください。
Slurm データの収集を開始するには、Agent の構成ディレクトリのルートにある conf.d/ フォルダーの slurm.d/conf.yaml ファイルを編集します。使用可能なすべての構成オプションの詳細については、サンプル slurm.d/conf.yaml を参照してください。

init_config:

    ## バイナリが /usr/bin/ ディレクトリにない場合は、この部分をカスタマイズします
    ## @param slurm_binaries_dir - 文字列 - オプション - デフォルト: /usr/bin/
    ## すべての Slurm バイナリが置かれているディレクトリ。主なバイナリ:
    ## sinfo、sacct、sdiag、sshare

    slurm_binaries_dir: /usr/bin/

instances:

  - 
    ## @param collect_scontrol_stats - ブール値 - オプション - デフォルト: false
    ## scontrol コマンドから統計情報を収集するかどうか。これは主に、実行中のジョブの一覧を 
    ## PID とともに収集するためにワーカー ノードで使用されます。
    collect_scontrol_stats: true

    # 以下の残りの設定は、対象となる情報が管理ノード固有のもので、ワーカー ノードでは取得できないため、
    # ワーカー ノードでは無効にしておく必要があります。
    collect_sinfo_stats: false
    collect_sdiag_stats: false
    collect_squeue_stats: false
    collect_sacct_stats: false
    collect_sshare_stats: false
    collect_gpu_stats: false
    sinfo_collection_level: 1

Agent を再起動します。

検証

Agent の status サブコマンドを実行して、Checks セクションで slurm を探します。

収集データ

メトリクス


slurm.node.alloc_mem (gauge)	Number of megabytes allocated on the node. Shown as megabyte
slurm.node.cpu.allocated (gauge)	Number of CPUs allocated on the node for job-related tasks. Shown as cpu
slurm.node.cpu.idle (gauge)	Number of idle CPUs on the node. Shown as cpu
slurm.node.cpu.other (gauge)	Number of CPUs performing other or non-job-related tasks on the node. Shown as cpu
slurm.node.cpu.total (gauge)	Total number of CPUs on the node. Shown as cpu
slurm.node.cpu_load (gauge)	CPU load on the node as reported by the OS.
slurm.node.free_mem (gauge)	Free memory on the node as reported by the OS. Shown as megabyte
slurm.node.gpu_total (gauge)	Total number of GPUs on the node.
slurm.node.gpu_used (gauge)	Number of GPUs used on the node.
slurm.node.info (gauge)	Information about the Slurm node.
slurm.node.memory (gauge)	Total memory on the node as reported by the OS. Shown as megabyte
slurm.node.tmp_disk (gauge)	Temporary disk space on the node as reported by the OS. Shown as megabyte
slurm.partition.cpu.allocated (gauge)	(Deprecated) Number of CPUs allocated on the partition for job-related tasks. Shown as cpu
slurm.partition.cpu.idle (gauge)	(Deprecated) Number of idle CPUs on the partition. Shown as cpu
slurm.partition.cpu.other (gauge)	(Deprecated) Number of CPUs performing other or non-job-related tasks on the partition. Shown as cpu
slurm.partition.cpu.total (gauge)	(Deprecated) Total number of CPUs on the partition. Shown as cpu
slurm.partition.gpu_total (gauge)	Total number of GPUs on the partition.
slurm.partition.gpu_used (gauge)	Number of GPUs used on the partition.
slurm.partition.info (gauge)	Information about the Slurm partition.
slurm.partition.node.allocated (gauge)	Number of nodes allocated on the partition for job-related tasks. Shown as node
slurm.partition.node.idle (gauge)	Number of idle nodes on the partition. Shown as node
slurm.partition.node.other (gauge)	Number of nodes performing other or non-job-related tasks on the partition. Shown as node
slurm.partition.node.total (gauge)	Total number of nodes on the partition. Shown as node
slurm.partition.nodes.count (gauge)	Number of nodes in the partition. Shown as node
slurm.sacct.enabled (gauge)	Shows whether we’re collecting sacct metrics or not for this host.
slurm.sacct.job.duration (gauge)	Duration of the job in seconds. Shown as second
slurm.sacct.job.info (gauge)	Information about the Slurm job in sacct.
slurm.sacct.slurm_job_ave_disk_read (gauge)	Average number of bytes read from disk by a job or step. Shown as byte
slurm.sacct.slurm_job_avgcpu (gauge)	Average (system + user) CPU time of all tasks in job. Shown as second
slurm.sacct.slurm_job_avgrss (gauge)	Average resident set size of all tasks in job. Shown as byte
slurm.sacct.slurm_job_cputime (gauge)	Time used (Elapsed time * CPU count) by a job or step in cpu-seconds. Shown as second
slurm.sacct.slurm_job_max_disk_read (gauge)	Maximum number of bytes read from disk by a job or step. Shown as byte
slurm.sacct.slurm_job_maxrss (gauge)	Maximum resident set size of all tasks in job. Shown as byte
slurm.sacct.slurm_job_maxvm (gauge)	Maximum virtual memory size of all tasks in job. Shown as byte
slurm.scontrol.job.info (gauge)	Status of running jobs on worker node. Shown as job
slurm.sdiag.agent_count (gauge)	Number of agent threads. Shown as thread
slurm.sdiag.agent_queue_size (gauge)	Number of enqueued outgoing RPC requests in an internal retry list. Shown as request
slurm.sdiag.agent_thread_count (gauge)	Total count of active threads created by all the agent threads. Shown as thread
slurm.sdiag.backfill.depth_mean (gauge)	Mean count of jobs processed during all backfilling scheduling cycles since last reset. Shown as job
slurm.sdiag.backfill.depth_mean_try_depth (gauge)	The subset of Depth Mean that the backfill scheduler attempted to schedule. Shown as job
slurm.sdiag.backfill.last_cycle (gauge)	Time in microseconds of last backfill scheduling cycle. Shown as microsecond
slurm.sdiag.backfill.last_cycle_seconds_ago (gauge)	Time in seconds since the last scheduling cycle. Shown as second
slurm.sdiag.backfill.last_depth_cycle (gauge)	Number of processed jobs during last backfilling scheduling cycle. Shown as job
slurm.sdiag.backfill.last_depth_try_schedule (gauge)	Number of processed jobs during last backfilling scheduling cycle. Shown as job
slurm.sdiag.backfill.last_queue_length (gauge)	Number of jobs pending to be processed by backfilling algorithm. Shown as job
slurm.sdiag.backfill.last_table_size (gauge)	Number of different time slots tested by the backfill scheduler in its last iteration.
slurm.sdiag.backfill.max_cycle (gauge)	Time in microseconds of maximum backfill scheduling cycle execution since last reset. Shown as microsecond
slurm.sdiag.backfill.mean_cycle (gauge)	Mean time in microseconds of backfilling scheduling cycles since last reset. Shown as microsecond
slurm.sdiag.backfill.mean_table_size (gauge)	Mean count of different time slots tested by the backfill scheduler.
slurm.sdiag.backfill.queue_length_mean (gauge)	Mean count of jobs pending to be processed by backfilling algorithm. Shown as job
slurm.sdiag.backfill.total_cycles (gauge)	Number of backfill scheduling cycles since last reset.
slurm.sdiag.backfill.total_heterogeneous_components (gauge)	Number of heterogeneous job components started thanks to backfilling since last Slurm start.
slurm.sdiag.backfill.total_jobs_since_cycle_start (gauge)	Total backfilled jobs since last stats cycle restart. Shown as job
slurm.sdiag.backfill.total_jobs_since_start (gauge)	Total backfilled jobs since last slurm restart. Shown as job
slurm.sdiag.cycles_per_minute (gauge)	Scheduling executions per minute.
slurm.sdiag.dbd_agent_queue_size (gauge)	DBD Agent message queue size for SlurmDBD. Shown as message
slurm.sdiag.enabled (gauge)	Shows whether we’re collecting sdiag metrics or not for this host.
slurm.sdiag.jobs_canceled (gauge)	Number of jobs canceled since last reset. Shown as job
slurm.sdiag.jobs_completed (gauge)	Number of jobs completed since last reset. Shown as job
slurm.sdiag.jobs_failed (gauge)	Number of jobs failed since last reset. Shown as job
slurm.sdiag.jobs_pending (gauge)	Number of jobs pending since last reset. Shown as job
slurm.sdiag.jobs_running (gauge)	Number of jobs running since last reset. Shown as job
slurm.sdiag.jobs_started (gauge)	Number of jobs started since last reset. Shown as job
slurm.sdiag.jobs_submitted (gauge)	Number of jobs submitted since last reset. Shown as job
slurm.sdiag.last_cycle (gauge)	Time in microseconds for last scheduling cycle. Shown as microsecond
slurm.sdiag.last_queue_length (gauge)	Length of jobs pending queue. Shown as job
slurm.sdiag.max_cycle (gauge)	Maximum time in microseconds for any scheduling cycle since last reset. Shown as microsecond
slurm.sdiag.mean_cycle (gauge)	Mean time in microseconds for all scheduling cycles since last reset. Shown as microsecond
slurm.sdiag.mean_depth_cycle (gauge)	Mean of cycle depth. Depth means number of jobs processed in a scheduling cycle. Shown as job
slurm.sdiag.server_thread_count (gauge)	The number of current active slurmctld threads. Shown as thread
slurm.sdiag.total_cycles (gauge)	The total run time in microseconds for all scheduling cycles since the last reset. Shown as microsecond
slurm.seff.cpu_efficiency (gauge)	The CPU efficiency of the job. Shown as percent
slurm.seff.cpu_utilized (gauge)	The CPU utilized by the job. Shown as second
slurm.seff.memory_efficiency (gauge)	The memory efficiency of the job. Shown as percent
slurm.seff.memory_utilized_mb (gauge)	The memory utilized by the job. Shown as megabyte
slurm.share.effective_usage (gauge)	The association’s usage normalized with its parent.
slurm.share.fair_share (gauge)	The Fair-Share factor, based on a user or account’s assigned shares and the effective usage charged to them or their accounts.
slurm.share.level_fs (gauge)	This is the association’s fairshare value compared to its siblings,calculated as norm_shares / effective_usage.
slurm.share.norm_shares (gauge)	The shares assigned to the user or account normalized to the total number of assigned shares.
slurm.share.norm_usage (gauge)	The Raw Usage normalized to the total number of tres-seconds of all jobs run on the cluster.
slurm.share.raw_shares (gauge)	The raw shares assigned to the user or account.
slurm.share.raw_usage (gauge)	The number of tres-seconds (cpu-seconds if TRESBillingWeights is not defined) of all the jobs charged to the account or user.
slurm.sinfo.node.enabled (gauge)	Shows whether we’re collecting node metrics or not for this host.
slurm.sinfo.partition.enabled (gauge)	Shows whether we’re collecting partition metrics or not for this host.
slurm.sinfo.squeue.enabled (gauge)	Shows whether we’re collecting squeue metrics or not for this host.
slurm.squeue.job.info (gauge)	Information about the Slurm job in squeue.
slurm.sshare.enabled (gauge)	Shows whether we’re collecting sshare metrics or not for this host.

イベント

Slurm インテグレーションには、イベントは含まれていません。

トラブルシューティング

ご不明な点は、Datadog のサポートチームまでお問合せください。