Supported OS Linux Windows Mac OS

インテグレーションバージョン1.3.0
このページは日本語には対応しておりません。随時翻訳に取り組んでいます。
翻訳に関してご質問やご意見ございましたら、お気軽にご連絡ください

Overview

This check monitors IBM Spectrum LSF using the Datadog Agent.

This integration gives an overview of the performance of your IBM Spectrum LSF environment. It also provides detailed information about running and completed jobs, slot utilization, and queues.

Setup

Follow the instructions below to install and configure this check for an Agent running on a host. For containerized environments, see the Autodiscovery Integration Templates for guidance on applying these instructions.

Installation

The IBM Spectrum LSF check is included in the Datadog Agent package.

Install the Datadog Agent and configure the IBM Spectrum LSF check on the management host of your cluster. This integration monitors the entire cluster.

Additional Configuration on Linux

Add the dd-agent user as an LSF administrator.

The integration runs commands such as lsid, bhosts, and lsclusters. In order to run these commands, the Agent needs them in its PATH. This is typically done by running source $LSF_HOME/conf/profile.lsf. However, the Datadog Agent uses upstart or systemd to orchestrate the datadog-agent service. You may need to add environment variables to the service configuration files:

  1. To get the environment variables necessary for the Agent service, locate the <LSF_TOP_DIR>/conf/profile.lsf file and run the following command:

    env -i bash -c "source <LSF_TOP_DIR>/conf/profile.lsf; env"
    

    Running this command outputs a list of environment variables necessary to run the IBM Spectrum LSF commands.

  2. Add these environment variables to the configuration file for either systemd or upstart:

    • systemd: /etc/datadog-agent/environment. Here is an example configuration:

      LSF_SERVERDIR=<LSF_TOP_DIR>/10.1/linux3.10-glibc2.17-x86_64/etc
      LSF_ENVDIR=<LSF_TOP_DIR>/conf
      LSF_BINDIR=<LSF_TOP_DIR>/10.1/linux3.10-glibc2.17-x86_64/bin
      LSF_LIBDIR=<LSF_TOP_DIR>/10.1/linux3.10-glibc2.17-x86_64/lib
      PATH=<LSF_TOP_DIR>/10.1/linux3.10-glibc2.17-x86_64/etc:<LSF_TOP_DIR>/10.1/linux3.10-glibc2.17-x86_64/bin:/usr/local/bin:/usr/bin:/bin:.
      LD_LIBRARY_PATH=<LSF_TOP_DIR>/10.1/linux3.10-glibc2.17-x86_64/lib
      
    • upstart: /etc/init/datadog-agent.conf. (Note that each time there is an Agent update, /etc/init/datadog-agent.conf is wiped and needs to be updated again.) Here is an example configuration:

      description "Datadog Agent"
      
      start on started networking
      stop on runlevel [!2345]
      
      respawn
      respawn limit 10 5
      normal exit 0
      
      console log
      env DD_LOG_TO_CONSOLE=false
      env LSF_SERVERDIR=<LSF_TOP_DIR>/10.1/linux3.10-glibc2.17-x86_64/etc
      env LSF_ENVDIR=<LSF_TOP_DIR>/conf
      env LSF_BINDIR=<LSF_TOP_DIR>/10.1/linux3.10-glibc2.17-x86_64/bin
      env LSF_LIBDIR=<LSF_TOP_DIR>/10.1/linux3.10-glibc2.17-x86_64/lib
      env PATH=<LSF_TOP_DIR>/10.1/linux3.10-glibc2.17-x86_64/etc:<LSF_TOP_DIR>/10.1/linux3.10-glibc2.17-x86_64/bin:/usr/local/bin:/usr/bin:/bin:.
      env LD_LIBRARY_PATH=<LSF_TOP_DIR>/10.1/linux3.10-glibc2.17-x86_64/lib
      
      setuid dd-agent
      
      script
        exec /opt/datadog-agent/bin/agent/agent start -p /opt/datadog-agent/run/agent.pid
      end script
      
        rm -f /opt/datadog-agent/run/agent.pid
      end script
      
  3. Restart the Agent.

View more information about setting environment variables for the Datadog Agent here.

Configuration

  1. Edit the ibm_spectrum_lsf.d/conf.yaml file, in the conf.d/ folder at the root of your Agent’s configuration directory to start collecting your ibm_spectrum_lsf performance data. See the sample ibm_spectrum_lsf.d/conf.yaml for all available configuration options.

    The IBM Spectrum LSF integration runs a series of management commands to collect data. To control which commands are run and which metrics are emitted, use the metric_sources configuration option. By default, data from the following commands are collected, but you can enable more optional metrics or opt out of collecting any set of metrics: lsclusters, lshosts, bhosts, lsload, bqueues, bslots, bjobs.

    For example, if you want to only measure GPU-specific metrics, your metrics_sources will look like:

      metric_sources:
        - lsload_gpu
        - bhosts_gpu
    

    The badmin_perfmon metric source collects data from the badmin perfmon view -json command. This collects overall statistics about the cluster. To collect these metrics, performance collection must be enabled on your server using the badmin perfmon start <COLLECTION_INTERVAL> command. By default, the integration runs this command automatically (and stops collection once the Agent is turned off). However, you can turn off this behavior by setting badmin_perfmon_auto: false.

    Since collecting these metrics can add extra load on your server, we recommend setting a higher collection interval for these metrics, or at least 60. The exact interval depends on the load and size of your cluster. View IBM Spectrum LSF’s recommendations for managing high query load.

    Similarly, the bhist command collects information about completed jobs, which can be query-intensive, so we recommend monitoring this command with the min_collection_interval set to 60 or higher. The bhist_details command involves running bhist -l for each completed job, so we recommend monitoring it with a higher min_collection_interval along with bhist.

    Here is a sample configuration monitoring all available metrics:

    instances:
    - cluster_name: test-cluster
      metric_sources:
        - lsclusters
        - lshosts
        - bhosts
        - lsload
        - bqueues
        - bslots
        - bjobs
        - lsload_gpu
        - bhosts_gpu
    - cluster_name: test-cluster
      badmin_perfmon_auto: false
      metric_sources:
        - badmin_perfmon
        - bhist
        - bhist_details
      min_collection_interval: 60
    
  2. Restart the Agent.

Logs

The IBM Spectrum LSF integration collects two types of logs: system logs and job logs.

Collecting system logs

System logs provide diagnostic information from the IBM Spectrum LSF daemons. You can collect them from the management host and execution hosts. To collect system logs:

  1. Enable log collection in your datadog.yaml file:

    logs_enabled: true
    
  2. Uncomment and edit the logs configuration block in your ibm_spectrum_lsf.d/conf.yaml file. For example:

      - type: file
        source: ibm_spectrum_lsf
        tags:
         - log_type:system
        path: <LSF_TOP_DIR>/log/*
        service: <SERVICE_NAME>
    
Collecting job logs
Job logs are located on the job submission host, which is typically different from the management host. Ensure that the Datadog Agent is installed and running on the host where jobs are submitted.

Job logs are generated by job tasks and are useful for debugging failed jobs. To collect job logs:

  1. Ensure that the IBM Spectrum LSF job log files you want to monitor are named <JOB_ID>.out and <JOB_ID>.err. Configure this when submitting jobs by using the following bsub options:

    bsub -o %J.out -e %J.err

  2. Enable log collection in your datadog.yaml file:

    logs_enabled: true
    
  3. Uncomment and edit the logs configuration block in your ibm_spectrum_lsf.d/conf.yaml file. For example:

     logs:
      - type: file
        source: ibm_spectrum_lsf
        tags:
        - log_type:job
        path:
        - <PATH_TO_JOB_LOGS>/*.out
        - <PATH_TO_SYSTEM_LOGS>/*.err
        service: <SERVICE_NAME>
    

Validation

Run the Agent’s status subcommand and look for ibm_spectrum_lsf under the Checks section.

Data Collected

Metrics

ibm_spectrum_lsf.can_connect
(gauge)
Whether or not the integration can run LSF commands [Always reported]
ibm_spectrum_lsf.cluster.hosts
(gauge)
The number of hosts in the cluster. [Reported by lsclusters]
ibm_spectrum_lsf.cluster.servers
(gauge)
The number of servers in the cluster. [Reported by lsclusters]
ibm_spectrum_lsf.cluster.status
(gauge)
The status of the cluster. [Reported by lsclusters]
ibm_spectrum_lsf.gpu.ecc
(gauge)
Number of ECC errors. [Reported by lsload_gpu]
ibm_spectrum_lsf.gpu.error
(gauge)
Whether or not the GPU is in an error state. [Reported by lsload_gpu]
ibm_spectrum_lsf.gpu.mem.total
(gauge)
The total memory available on the GPU. [Reported by lsload_gpu]
ibm_spectrum_lsf.gpu.mem.used
(gauge)
The total memory used on the GPU. [Reported by lsload_gpu]
ibm_spectrum_lsf.gpu.mem.utilization
(gauge)
The percentage of the GPU’s memory currently in use. [Reported by lsload_gpu]
ibm_spectrum_lsf.gpu.mode
(gauge)
The GPU’s compute mode, 0 is default. [Reported by lsload_gpu]
ibm_spectrum_lsf.gpu.power
(gauge)
Current power draw of the GPU in watts. [Reported by lsload_gpu]
Shown as watt
ibm_spectrum_lsf.gpu.pstate
(gauge)
Current performance state of the GPU. [Reported by lsload_gpu]
ibm_spectrum_lsf.gpu.status
(gauge)
Whether or not the GPU is OK. [Reported by lsload_gpu]
ibm_spectrum_lsf.gpu.temperature
(gauge)
The current temperature of the GPU. [Reported by lsload_gpu]
Shown as degree celsius
ibm_spectrum_lsf.gpu.utilization
(gauge)
The current GPU utilization. [Reported by lsload_gpu]
ibm_spectrum_lsf.host.cpu_factor
(gauge)
The relative CPU performance factor. [Reported by lshosts]
ibm_spectrum_lsf.host.is_server
(gauge)
Indicates whether the host is a server or client host. [Reported by lshosts]
ibm_spectrum_lsf.host.max_mem
(gauge)
The maximum amount of physical memory available for user processes. [Reported by lshosts]
ibm_spectrum_lsf.host.max_swap
(gauge)
The total available swap space. [Reported by lshosts]
ibm_spectrum_lsf.host.max_temp
(gauge)
The maximum /tmp space in MB configured on a host. [Reported by lshosts]
Shown as megabyte
ibm_spectrum_lsf.host.num_cores
(gauge)
The number of cores per processor that is configured on a host. [Reported by lshosts]
Shown as core
ibm_spectrum_lsf.host.num_cpus
(gauge)
The number of processors on this host. [Reported by lshosts]
ibm_spectrum_lsf.host.num_procs
(gauge)
The number of physical processors per CPU configured on a host. [Reported by lshosts]
ibm_spectrum_lsf.host.num_threads
(gauge)
The number of threads per core that is configured on a host. [Reported by lshosts]
Shown as thread
ibm_spectrum_lsf.job.completed.details.avg_memory
(gauge)
The average memory used by the completed job. [Reported by bhist_details]
Shown as megabyte
ibm_spectrum_lsf.job.completed.details.cpu_average_efficiency
(gauge)
The CPU average efficiency percentage of the completed job. [Reported by bhist_details]
Shown as percent
ibm_spectrum_lsf.job.completed.details.cpu_peak
(gauge)
The CPU peak value for the completed job. [Reported by bhist_details]
ibm_spectrum_lsf.job.completed.details.cpu_peak_duration
(gauge)
The duration of CPU peak usage for the completed job. [Reported by bhist_details]
Shown as second
ibm_spectrum_lsf.job.completed.details.cpu_peak_efficiency
(gauge)
The CPU peak efficiency percentage of the completed job. [Reported by bhist_details]
Shown as percent
ibm_spectrum_lsf.job.completed.details.cpu_time
(gauge)
The total CPU time consumed by the completed job. [Reported by bhist_details]
Shown as second
ibm_spectrum_lsf.job.completed.details.exit_code
(gauge)
The exit code returned by the completed job. [Reported by bhist_details]
ibm_spectrum_lsf.job.completed.details.max_memory
(gauge)
The maximum memory used by the completed job. [Reported by bhist_details]
Shown as megabyte
ibm_spectrum_lsf.job.completed.details.mem_efficiency
(gauge)
The memory efficiency percentage of the completed job. [Reported by bhist_details]
Shown as percent
ibm_spectrum_lsf.job.completed.details.status
(gauge)
The status of the completed job (1). Tagged with status:success or status:failure. [Reported by bhist_details]
ibm_spectrum_lsf.job.completed.details.success
(gauge)
Indicates whether the job completed successfully (1) or failed (0). [Reported by bhist_details]
ibm_spectrum_lsf.job.completed.pending
(gauge)
The total amount of time spent by the job in the pending state. [Reported by bhist]
Shown as second
ibm_spectrum_lsf.job.completed.pending_user_suspended
(gauge)
The total amount of time spent by the job in the user suspended state. [Reported by bhist]
Shown as second
ibm_spectrum_lsf.job.completed.running
(gauge)
The total run time of the job. [Reported by bhist]
Shown as second
ibm_spectrum_lsf.job.completed.system_suspended
(gauge)
The total amount of time the job was in the system suspended state. [Reported by bhist]
Shown as second
ibm_spectrum_lsf.job.completed.total
(gauge)
The total amount of time spent by the job from submission to completion. [Reported by bhist]
Shown as second
ibm_spectrum_lsf.job.completed.unknown
(gauge)
The total amount of time spent by the job in an unknown state. [Reported by bhist]
Shown as second
ibm_spectrum_lsf.job.completed.user_suspended
(gauge)
The total amount of time spent by the job in the user suspended state. [Reported by bhist]
Shown as second
ibm_spectrum_lsf.job.cpu_used
(gauge)
The CPU used by the job. [Reported by bjobs]
ibm_spectrum_lsf.job.idle_factor
(gauge)
Job idle information (CPU time/runtime) if JOB_IDLE is configured in the queue, and the job has triggered an idle exception. [Reported by bjobs]
ibm_spectrum_lsf.job.mem
(gauge)
Total resident memory usage of all processes in a job. [Reported by bjobs]
ibm_spectrum_lsf.job.percent_complete
(gauge)
The estimated completion percentage of the job. [Reported by bjobs]
ibm_spectrum_lsf.job.run_time
(gauge)
Estimated run time for the job. [Reported by bjobs]
Shown as second
ibm_spectrum_lsf.job.swap
(gauge)
Total virtual memory and swap usage of all processes in a job. [Reported by bjobs]
ibm_spectrum_lsf.job.time_left
(gauge)
The estimated run time that the job has remaining. [Reported by bjobs]
Shown as second
ibm_spectrum_lsf.load.cpu.run_queue_length.15m
(gauge)
The 15 minute exponentially averaged CPU run queue length. [Reported by lsload]
ibm_spectrum_lsf.load.cpu.run_queue_length.15s
(gauge)
The 15 second exponentially averaged CPU run queue length. [Reported by lsload]
ibm_spectrum_lsf.load.cpu.run_queue_length.1m
(gauge)
The 1 minute exponentially averaged CPU run queue length. [Reported by lsload]
ibm_spectrum_lsf.load.cpu.utilization
(gauge)
The CPU utilization exponentially averaged over the last minute, 0 - 1. [Reported by lsload]
ibm_spectrum_lsf.load.disk.io
(gauge)
the disk I/O rate exponentially averaged over the last minute, in KB per second. [Reported by lsload]
Shown as kilobyte
ibm_spectrum_lsf.load.idle_time
(gauge)
On UNIX, the idle time of the host (keyboard is not touched on all logged in sessions), in minutes. On Windows, the it index is based on the time that a screen saver is active on a particular host. [Reported by lsload]
Shown as minute
ibm_spectrum_lsf.load.login_users
(gauge)
The number of current login users. [Reported by lsload]
ibm_spectrum_lsf.load.mem.available_ram
(gauge)
The amount of available RAM. [Reported by lsload]
Shown as megabyte
ibm_spectrum_lsf.load.mem.available_swap
(gauge)
The amount of available swap space. [Reported by lsload]
Shown as megabyte
ibm_spectrum_lsf.load.mem.free
(gauge)
The amount of free space in /tmp, in MB. [Reported by lsload]
Shown as megabyte
ibm_spectrum_lsf.load.mem.paging_rate
(gauge)
The memory paging rate exponentially averaged over the last minute, in pages per second. [Reported by lsload]
Shown as page
ibm_spectrum_lsf.load.status
(gauge)
Status of the host. [Reported by lsload]
ibm_spectrum_lsf.perfmon.host.queries.avg
(gauge)
The average number of host information queries in the sampling period. [Reported by badmin_perfmon]
ibm_spectrum_lsf.perfmon.host.queries.current
(gauge)
The current number of host information queries in the sampling period. [Reported by badmin_perfmon]
ibm_spectrum_lsf.perfmon.host.queries.max
(gauge)
The max number of host information queries in the sampling period. [Reported by badmin_perfmon]
ibm_spectrum_lsf.perfmon.host.queries.min
(gauge)
The min number of host information queries in the sampling period. [Reported by badmin_perfmon]
ibm_spectrum_lsf.perfmon.host.queries.total
(gauge)
The total number of host information queries in the sampling period. [Reported by badmin_perfmon]
ibm_spectrum_lsf.perfmon.jobs.accepted_remote.avg
(gauge)
The average number of jobs accepted from remote cluster in the sampling period. [Reported by badmin_perfmon]
Shown as job
ibm_spectrum_lsf.perfmon.jobs.accepted_remote.current
(gauge)
The current number of jobs accepted from remote cluster in the sampling period. [Reported by badmin_perfmon]
Shown as job
ibm_spectrum_lsf.perfmon.jobs.accepted_remote.max
(gauge)
The max number of jobs accepted from remote cluster in the sampling period. [Reported by badmin_perfmon]
Shown as job
ibm_spectrum_lsf.perfmon.jobs.accepted_remote.min
(gauge)
The min number of jobs accepted from remote cluster in the sampling period. [Reported by badmin_perfmon]
Shown as job
ibm_spectrum_lsf.perfmon.jobs.accepted_remote.total
(gauge)
The total number of jobs accepted from remote cluster in the sampling period. [Reported by badmin_perfmon]
Shown as job
ibm_spectrum_lsf.perfmon.jobs.buckets.avg
(gauge)
The average number of scheduler buckets in which jobs are put based on resource requirements and different scheduling policies. [Reported by badmin_perfmon]
ibm_spectrum_lsf.perfmon.jobs.buckets.current
(gauge)
The current number of scheduler buckets in which jobs are put based on resource requirements and different scheduling policies. [Reported by badmin_perfmon]
ibm_spectrum_lsf.perfmon.jobs.buckets.max
(gauge)
The max number of scheduler buckets in which jobs are put based on resource requirements and different scheduling policies. [Reported by badmin_perfmon]
ibm_spectrum_lsf.perfmon.jobs.buckets.min
(gauge)
The min number of scheduler buckets in which jobs are put based on resource requirements and different scheduling policies. [Reported by badmin_perfmon]
ibm_spectrum_lsf.perfmon.jobs.buckets.total
(gauge)
The total number of scheduler buckets in which jobs are put based on resource requirements and different scheduling policies. [Reported by badmin_perfmon]
ibm_spectrum_lsf.perfmon.jobs.completed.avg
(gauge)
The average amount of jobs completed in the sampling period. [Reported by badmin_perfmon]
Shown as job
ibm_spectrum_lsf.perfmon.jobs.completed.current
(gauge)
The amount of jobs completed in the sampling period. [Reported by badmin_perfmon]
Shown as job
ibm_spectrum_lsf.perfmon.jobs.completed.max
(gauge)
The max amount of jobs completed in the sampling period. [Reported by badmin_perfmon]
Shown as job
ibm_spectrum_lsf.perfmon.jobs.completed.min
(gauge)
The min amount of jobs completed in the sampling period. [Reported by badmin_perfmon]
Shown as job
ibm_spectrum_lsf.perfmon.jobs.completed.total
(gauge)
The total amount of jobs completed in the sampling period. [Reported by badmin_perfmon]
Shown as job
ibm_spectrum_lsf.perfmon.jobs.dispatched.avg
(gauge)
The average number of jobs dispatched in the sampling period. [Reported by badmin_perfmon]
Shown as job
ibm_spectrum_lsf.perfmon.jobs.dispatched.current
(gauge)
The number of jobs dispatched. [Reported by badmin_perfmon]
Shown as job
ibm_spectrum_lsf.perfmon.jobs.dispatched.max
(gauge)
The max number of jobs dispatched in the sampling period. [Reported by badmin_perfmon]
Shown as job
ibm_spectrum_lsf.perfmon.jobs.dispatched.min
(gauge)
The min number of jobs dispatched in the sampling period. [Reported by badmin_perfmon]
Shown as job
ibm_spectrum_lsf.perfmon.jobs.dispatched.total
(gauge)
The total number of jobs dispatched in the sampling period. [Reported by badmin_perfmon]
Shown as job
ibm_spectrum_lsf.perfmon.jobs.queries.avg
(gauge)
The average number of job queries in the sampling period. [Reported by badmin_perfmon]
ibm_spectrum_lsf.perfmon.jobs.queries.current
(gauge)
The number of job queries in the sampling period. [Reported by badmin_perfmon]
ibm_spectrum_lsf.perfmon.jobs.queries.max
(gauge)
The max number of job queries in the sampling period. [Reported by badmin_perfmon]
ibm_spectrum_lsf.perfmon.jobs.queries.min
(gauge)
The min number of job queries in the sampling period. [Reported by badmin_perfmon]
ibm_spectrum_lsf.perfmon.jobs.queries.total
(gauge)
The total number of job queries in the sampling period. [Reported by badmin_perfmon]
ibm_spectrum_lsf.perfmon.jobs.reordered.avg
(gauge)
The average number of jobs reordered in the sampling period, that is, the number of jobs that reused the resource allocation of a finished job. [Reported by badmin_perfmon]
Shown as job
ibm_spectrum_lsf.perfmon.jobs.reordered.current
(gauge)
The number of jobs reordered in the sampling period, that is, the number of jobs that reused the resource allocation of a finished job. [Reported by badmin_perfmon]
Shown as job
ibm_spectrum_lsf.perfmon.jobs.reordered.max
(gauge)
The max number of jobs reordered in the sampling period, that is, the number of jobs that reused the resource allocation of a finished job. [Reported by badmin_perfmon]
Shown as job
ibm_spectrum_lsf.perfmon.jobs.reordered.min
(gauge)
The min number of jobs reordered in the sampling period, that is, the number of jobs that reused the resource allocation of a finished job. [Reported by badmin_perfmon]
Shown as job
ibm_spectrum_lsf.perfmon.jobs.reordered.total
(gauge)
The total number of jobs reordered in the sampling period, that is, the number of jobs that reused the resource allocation of a finished job. [Reported by badmin_perfmon]
Shown as job
ibm_spectrum_lsf.perfmon.jobs.scheduling_interval.avg
(gauge)
The average scheduling interval in the sampling period. [Reported by badmin_perfmon]
Shown as second
ibm_spectrum_lsf.perfmon.jobs.scheduling_interval.current
(gauge)
The current scheduling interval in the sampling period. [Reported by badmin_perfmon]
Shown as second
ibm_spectrum_lsf.perfmon.jobs.scheduling_interval.max
(gauge)
The max scheduling interval in the sampling period. [Reported by badmin_perfmon]
Shown as second
ibm_spectrum_lsf.perfmon.jobs.scheduling_interval.min
(gauge)
The min scheduling interval in the sampling period. [Reported by badmin_perfmon]
Shown as second
ibm_spectrum_lsf.perfmon.jobs.scheduling_interval.total
(gauge)
The total scheduling interval in the sampling period. [Reported by badmin_perfmon]
Shown as second
ibm_spectrum_lsf.perfmon.jobs.sent_remote.avg
(gauge)
The average number of jobs sent to remote cluster in the sampling period. [Reported by badmin_perfmon]
Shown as job
ibm_spectrum_lsf.perfmon.jobs.sent_remote.current
(gauge)
The number of jobs sent to remote cluster in the sampling period. [Reported by badmin_perfmon]
Shown as job
ibm_spectrum_lsf.perfmon.jobs.sent_remote.max
(gauge)
The max number of jobs sent to remote cluster in the sampling period. [Reported by badmin_perfmon]
Shown as job
ibm_spectrum_lsf.perfmon.jobs.sent_remote.min
(gauge)
The avminerage number of jobs sent to remote cluster in the sampling period. [Reported by badmin_perfmon]
Shown as job
ibm_spectrum_lsf.perfmon.jobs.sent_remote.total
(gauge)
The total number of jobs sent to remote cluster in the sampling period. [Reported by badmin_perfmon]
Shown as job
ibm_spectrum_lsf.perfmon.jobs.submission_requests.avg
(gauge)
The average number of job submission requests in the sampling period. [Reported by badmin_perfmon]
Shown as request
ibm_spectrum_lsf.perfmon.jobs.submission_requests.current
(gauge)
The number of job submission requests in the sampling period. [Reported by badmin_perfmon]
Shown as request
ibm_spectrum_lsf.perfmon.jobs.submission_requests.max
(gauge)
The max number of job submission requests in the sampling period. [Reported by badmin_perfmon]
Shown as request
ibm_spectrum_lsf.perfmon.jobs.submission_requests.min
(gauge)
The min number of job submission requests in the sampling period. [Reported by badmin_perfmon]
Shown as request
ibm_spectrum_lsf.perfmon.jobs.submission_requests.total
(gauge)
The total number of job submission requests in the sampling period. [Reported by badmin_perfmon]
Shown as request
ibm_spectrum_lsf.perfmon.jobs.submitted.avg
(gauge)
The average number of jobs submitted in the sampling period. [Reported by badmin_perfmon]
Shown as job
ibm_spectrum_lsf.perfmon.jobs.submitted.current
(gauge)
The number of jobs submitted in the sampling period. [Reported by badmin_perfmon]
Shown as job
ibm_spectrum_lsf.perfmon.jobs.submitted.max
(gauge)
The max number of jobs submitted in the sampling period. [Reported by badmin_perfmon]
Shown as job
ibm_spectrum_lsf.perfmon.jobs.submitted.min
(gauge)
The min number of jobs submitted in the sampling period. [Reported by badmin_perfmon]
Shown as job
ibm_spectrum_lsf.perfmon.jobs.submitted.total
(gauge)
The total number of jobs submitted in the sampling period. [Reported by badmin_perfmon]
Shown as job
ibm_spectrum_lsf.perfmon.mbatchd.processed_requests.avg
(gauge)
The average number of queries handled by mbatchd in the sampling period. [Reported by badmin_perfmon]
Shown as request
ibm_spectrum_lsf.perfmon.mbatchd.processed_requests.current
(gauge)
The number of queries handled by mbatchd in the sampling period. [Reported by badmin_perfmon]
Shown as request
ibm_spectrum_lsf.perfmon.mbatchd.processed_requests.max
(gauge)
The max number of queries handled by mbatchd in the sampling period. [Reported by badmin_perfmon]
Shown as request
ibm_spectrum_lsf.perfmon.mbatchd.processed_requests.min
(gauge)
The min number of queries handled by mbatchd in the sampling period. [Reported by badmin_perfmon]
Shown as request
ibm_spectrum_lsf.perfmon.mbatchd.processed_requests.total
(gauge)
The total number of queries handled by mbatchd in the sampling period. [Reported by badmin_perfmon]
Shown as request
ibm_spectrum_lsf.perfmon.memory.utilization.current
(gauge)
Current memory utilization. [Reported by badmin_perfmon]
ibm_spectrum_lsf.perfmon.memory.utilization.total
(gauge)
Total memory utilization. [Reported by badmin_perfmon]
ibm_spectrum_lsf.perfmon.queue.queries.avg
(gauge)
The average number of queue queries in the sampling period. [Reported by badmin_perfmon]
ibm_spectrum_lsf.perfmon.queue.queries.current
(gauge)
The number of queue queries in the sampling period. [Reported by badmin_perfmon]
ibm_spectrum_lsf.perfmon.queue.queries.max
(gauge)
The max number of queue queries in the sampling period. [Reported by badmin_perfmon]
ibm_spectrum_lsf.perfmon.queue.queries.min
(gauge)
The min number of queue queries in the sampling period. [Reported by badmin_perfmon]
ibm_spectrum_lsf.perfmon.queue.queries.total
(gauge)
The total number of queue queries in the sampling period. [Reported by badmin_perfmon]
ibm_spectrum_lsf.perfmon.scheduler.host_matches.avg
(gauge)
The average number of hosts matching the resource criteria for a job. [Reported by badmin_perfmon]
Shown as host
ibm_spectrum_lsf.perfmon.scheduler.host_matches.current
(gauge)
The number of hosts matching the resource criteria for a job. [Reported by badmin_perfmon]
Shown as host
ibm_spectrum_lsf.perfmon.scheduler.host_matches.max
(gauge)
The max number of hosts matching the resource criteria for a job. [Reported by badmin_perfmon]
Shown as host
ibm_spectrum_lsf.perfmon.scheduler.host_matches.min
(gauge)
The min number of hosts matching the resource criteria for a job. [Reported by badmin_perfmon]
Shown as host
ibm_spectrum_lsf.perfmon.scheduler.host_matches.total
(gauge)
The total number of hosts matching the resource criteria for a job in the sampling period. [Reported by badmin_perfmon]
Shown as host
ibm_spectrum_lsf.perfmon.slots.utilization.current
(gauge)
The current slot utilization. [Reported by badmin_perfmon]
ibm_spectrum_lsf.perfmon.slots.utilization.total
(gauge)
The total slot utilization of the sampling period. [Reported by badmin_perfmon]
ibm_spectrum_lsf.queue.is_active
(gauge)
Whether or not jobs in the queue can be started. [Reported by bqueues]
ibm_spectrum_lsf.queue.is_open
(gauge)
Whether or not the queue can accept jobs. [Reported by bqueues]
ibm_spectrum_lsf.queue.max_jobs
(gauge)
The maximum number of job slots that can be used by the jobs from the queue. These job slots are used by dispatched jobs that are not yet finished, and by pending jobs that reserve slots. [Reported by bqueues]
ibm_spectrum_lsf.queue.max_jobs_per_host
(gauge)
The maximum number of job slots a host can allocate from this queue. [Reported by bqueues]
ibm_spectrum_lsf.queue.max_jobs_per_processor
(gauge)
The maximum number of job slots a processor can process from the queue. [Reported by bqueues]
ibm_spectrum_lsf.queue.max_jobs_per_user
(gauge)
The maximum number of job slots each user can use for jobs in the queue. [Reported by bqueues]
ibm_spectrum_lsf.queue.num_job_slots
(gauge)
The total number of slots for jobs in the queue. [Reported by bqueues]
ibm_spectrum_lsf.queue.pending
(gauge)
The total number of tasks for all pending jobs in the queue. [Reported by bqueues]
Shown as job
ibm_spectrum_lsf.queue.priority
(gauge)
The priority of the queue. The larger the value, the higher the priority. [Reported by bqueues]
ibm_spectrum_lsf.queue.running
(gauge)
The total number of tasks for all running jobs in the queue. If the -alloc option is used, the total is allocated slots for the jobs in the queue. [Reported by bqueues]
Shown as task
ibm_spectrum_lsf.queue.suspended
(gauge)
The total number of tasks for all suspended jobs in the queue. [Reported by bqueues]
Shown as task
ibm_spectrum_lsf.server.gpu.num_gpus
(gauge)
The total number of GPUs. [Reported by bhosts_gpu]
ibm_spectrum_lsf.server.gpu.num_gpus_alloc
(gauge)
The current total number of GPUs that are allocated to be used by a job. [Reported by bhosts_gpu]
ibm_spectrum_lsf.server.gpu.num_gpus_exclusive_alloc
(gauge)
The current total number of GPUs that are allocated to be used exclusive by the job. [Reported by bhosts_gpu]
ibm_spectrum_lsf.server.gpu.num_gpus_exclusive_available
(gauge)
The current total number of GPUs that are used exclusive by the job. [Reported by bhosts_gpu]
ibm_spectrum_lsf.server.gpu.num_gpus_jexclusive_alloc
(gauge)
The total number of GPUs allocated exclusively for a job. [Reported by bhosts_gpu]
ibm_spectrum_lsf.server.gpu.num_gpus_shared_alloc
(gauge)
The total number of GPUs allocated but shared. [Reported by bhosts_gpu]
ibm_spectrum_lsf.server.gpu.num_gpus_shared_available
(gauge)
The current total number of GPUs that are available for concurrent use by multiple jobs. [Reported by bhosts_gpu]
ibm_spectrum_lsf.server.max_jobs
(gauge)
The maximum number of job slots available. A -1 indicates no limit. [Reported by bhosts]
Shown as job
ibm_spectrum_lsf.server.num_jobs
(gauge)
The number of tasks for all jobs that are dispatched to the host. The NJOBS value includes running, suspended, and chunk jobs. [Reported by bhosts]
Shown as task
ibm_spectrum_lsf.server.reserved
(gauge)
The number of tasks for all pending jobs with reserved slots on the host. [Reported by bhosts]
Shown as task
ibm_spectrum_lsf.server.running
(gauge)
The number of tasks for all running jobs on the host. [Reported by bhosts]
ibm_spectrum_lsf.server.slots_per_user
(gauge)
The maximum number of job slots that the host can process on a per user basis. A -1 indicates no limit. [Reported by bhosts]
ibm_spectrum_lsf.server.status
(gauge)
The status of the host and the sbatchd daemon. Batch jobs can be dispatched only to hosts with an ok status. 1 if ok, 0 otherwise. [Reported by bhosts]
ibm_spectrum_lsf.server.suspended
(gauge)
The number of tasks for all system suspended jobs on the host. [Reported by bhosts]
ibm_spectrum_lsf.server.user_suspended
(gauge)
The number of tasks for all user suspended jobs on the host. Jobs can be suspended by the user or by the LSF administrator. [Reported by bhosts]
ibm_spectrum_lsf.slots.backfill.available
(gauge)
The available slots for backfill jobs. [Reported by bslots]
ibm_spectrum_lsf.slots.runtime_limit
(gauge)
The runtime limit for the backfill slots. [Reported by bslots]

Events

The IBM Spectrum LSF integration does not include any events.

Service Checks

The IBM Spectrum LSF integration does not include any service checks.

Troubleshooting

Use the datadog-agent check command to view the metrics the integration is collecting, as well as debug logs from the check:

sudo -u dd-agent bash -c "source /usr/share/lsf/conf/profile.lsf && datadog-agent check ibm_spectrum_lsf -l debug"

Need help? Contact Datadog support.