Datadog-Hadoop HDFS datanode Integration

Overview

Track disk utilization and failed volumes on each of your HDFS DataNodes. This Agent check collects metrics for these, as well as block- and cache-related metrics.

Use this check (hdfs_datanode) and its counterpart check (hdfs_namenode), not the older two-in-one check (hdfs); that check is deprecated.

Setup

Installation

The HDFS DataNode check is packaged with the Agent, so simply install the Agent on your DataNodes.

Configuration

Prepare the DataNode

The Agent collects metrics from the DataNode’s JMX remote interface. The interface is disabled by default, so enable it by setting the following option in hadoop-env.sh (usually found in $HADOOP_HOME/conf):

export HADOOP_DATANODE_OPTS="-Dcom.sun.management.jmxremote
  -Dcom.sun.management.jmxremote.authenticate=false
  -Dcom.sun.management.jmxremote.ssl=false
  -Dcom.sun.management.jmxremote.port=50075 $HADOOP_DATANODE_OPTS"

Restart the DataNode process to enable the JMX interface.

Connect the Agent

Create a file hdfs_datanode.yaml in the Agent’s conf.d directory. See the sample hdfs_datanode.yaml for all available configuration options:

init_config:

instances:
  - hdfs_datanode_jmx_uri: http://localhost:50075

Restart the Agent to begin sending DataNode metrics to Datadog.

Validation

Run the Agent’s info subcommand and look for hdfs_datanode under the Checks section:

  Checks
  ======
    [...]

    hdfs_datanode
    -------
      - instance #0 [OK]
      - Collected 26 metrics, 0 events & 1 service check

    [...]

Compatibility

The hdfs_datanode check is compatible with all major platforms.

Data Collected

Metrics

hdfs.datanode.dfs_remaining
(gauge)
The remaining disk space left in bytes
shown as byte
hdfs.datanode.dfs_capacity
(gauge)
Disk capacity in bytes
shown as byte
hdfs.datanode.dfs_used
(gauge)
Disk usage in bytes
shown as byte
hdfs.datanode.cache_capacity
(gauge)
Cache capacity in bytes
shown as byte
hdfs.datanode.cache_used
(gauge)
Cache used in bytes
shown as byte
hdfs.datanode.num_failed_volumes
(gauge)
Number of failed volumes
shown as
hdfs.datanode.last_volume_failure_date
(gauge)
The date/time of the last volume failure in milliseconds since epoch
shown as millisecond
hdfs.datanode.estimated_capacity_lost_total
(gauge)
The estimated capacity lost in bytes
shown as byte
hdfs.datanode.num_blocks_cached
(gauge)
The number of blocks cached
shown as block
hdfs.datanode.num_blocks_failed_to_cache
(gauge)
The number of blocks that failed to cache
shown as block
hdfs.datanode.num_blocks_failed_to_uncache
(gauge)
The number of failed blocks to remove from cache
shown as block

Events

The HDFS-datanode check does not include any event at this time.

Service Checks

hdfs.datanode.jmx.can_connect:

Returns Critical if the Agent cannot connect to the DataNode’s JMX interface for any reason (e.g. wrong port provided, timeout, un-parseable JSON response).

Troubleshooting

Need help? Contact Datadog Support.

Further Reading