Lustre

Supported OS Linux

통합 버전1.0.0
이 페이지는 아직 영어로 제공되지 않습니다. 번역 작업 중입니다.
현재 번역 프로젝트에 대한 질문이나 피드백이 있으신 경우 언제든지 연락주시기 바랍니다.

Overview

This check monitors Lustre through the Datadog Agent.

Lustre is a distributed file system commonly used in high-performance computing (HPC) environments. This integration provides comprehensive monitoring of Lustre cluster performance, health, and operations across all node types: clients, metadata servers (MDS), and object storage servers (OSS).

The Datadog Agent can collect many metrics from Lustre clusters, including:

  • Device Health: Monitor the status and health of all Lustre devices and targets
  • Job Statistics: Track per-job I/O operations, latency, and throughput on MDS and OSS nodes
  • Network Statistics: Monitor LNET performance including local and peer network interface metrics
  • General Performance: Collect detailed statistics on file system operations, locks, and client activities
  • Changelog Events: Capture filesystem change events for audit and analysis (client nodes only)

Setup

Follow the instructions below to install and configure this check for an Agent running on a host.

Installation

The Lustre check is included in the Datadog Agent package. No additional installation is needed on your server.

Configuration

To configure the Agent check:

  1. Edit the lustre.d/conf.yaml file, in the conf.d/ folder at the root of your Agent’s configuration directory to start collecting your Lustre performance data. See the sample lustre.d/conf.yaml for all available configuration options.

  2. Add the dd-agent user to the sudoers file to allow it to run Lustre commands without a password. Edit the sudoers file with visudo and add:

    dd-agent ALL=(ALL) NOPASSWD: /path/to/lctl, /path/to/lnetctl, /path/to/lfs
    

    Note: The Datadog Agent must have sufficient privileges to execute Lustre commands (lctl, lnetctl, lfs). This typically requires running the Agent as root or with appropriate sudo permissions.

  3. Restart the Agent.

Logs

On client nodes, the Lustre integration can collect changelog events as structured logs. These logs contain:

  • operation_type: The type of filesystem operation
  • timestamp: When the operation occurred
  • flags: Operation flags
  • message: Detailed operation information

Important: Changelog users must be registered for changelogs to be collected. Use the lctl changelog_register command to register changelog users. Refer to the Lustre manual.

To collect Lustre changelogs:

  1. Enable logs in your datadog.yaml file:
   logs_enabled: true
  1. Uncomment and edit the logs configuration block in your lustre.d/conf.yaml file. For example:
   logs:
     - type: integration
       source: lustre
       service: lustre
  1. Enable changelog collection in the lustre.d/conf.yaml file.
   enable_changelogs: true

Validation

Run the Agent’s status subcommand and look for lustre under the Checks section.

Uninstallation

To uninstall this integration from your Agent, run the following command:

datadog-agent integration remove datadog-lustre

Alternatively, to disable the integration, rename the lustre.d/conf.yaml file to lustre.d/conf.yaml.example.

Support

Support

Need help? Contact Datadog Support.

Troubleshooting

Permissions

The Lustre integration requires elevated privileges to run Lustre commands. Ensure the Datadog Agent is running with appropriate permissions:

# Check if the Agent user can run Lustre commands
sudo -u dd-agent lctl dl
sudo -u dd-agent sudo lnetctl net show

Node type detection

If the integration cannot automatically detect the node type, specify it explicitly in the configuration:

instances:
  - node_type: client  # or 'mds' or 'oss'

Missing metrics

If expected metrics are not appearing:

  1. Verify the Lustre services are running and accessible.
  2. Check that the specified filesystem names match actual filesystems.
  3. Ensure the Agent has permission to read Lustre parameters.
  4. Enable debug logging to see detailed error messages.

Changelog registration

For changelog collection on client nodes, ensure changelog users are registered:

# Register a changelog user
lctl changelog_register

# List registered changelog users  
lctl changelog_users <filesystem>