Datadog-Hadoop HDFS namenode Integration

Overview

Monitor your primary and standby HDFS NameNodes to know when your cluster falls into a precarious state: when you’re down to one NameNode remaining, or when it’s time to add more capacity to the cluster. This Agent check collects metrics for remaining capacity, corrupt/missing blocks, dead DataNodes, filesystem load, under-replicated blocks, total volume failures (across all DataNodes), and many more.

Use this check (hdfs_namenode) and its counterpart check (hdfs_datanode), not the older two-in-one check (hdfs); that check is deprecated.

Setup

Installation

The HDFS NameNode check is packaged with the Agent, so simply install the Agent on your NameNodes.

Configuration

Prepare the NameNode

The Agent collects metrics from the NameNode’s JMX remote interface. The interface is disabled by default, so enable it by setting the following option in hadoop-env.sh (usually found in $HADOOP_HOME/conf):

export HADOOP_NAMENODE_OPTS="-Dcom.sun.management.jmxremote
  -Dcom.sun.management.jmxremote.authenticate=false
  -Dcom.sun.management.jmxremote.ssl=false
  -Dcom.sun.management.jmxremote.port=50070 $HADOOP_NAMENODE_OPTS"

Restart the NameNode process to enable the JMX interface.

Connect the Agent

Create a file hdfs_namenode.yaml in the Agent’s conf.d directory. See the sample hdfs_namenode.yaml for all available configuration options:

init_config:

instances:
  - hdfs_namenode_jmx_uri: http://localhost:50070

Restart the Agent to begin sending NameNode metrics to Datadog.

Validation

Run the Agent’s info subcommand and look for hdfs_namenode under the Checks section:

  Checks
  ======
    [...]

    hdfs_namenode
    -------
      - instance #0 [OK]
      - Collected 26 metrics, 0 events & 1 service check

    [...]

Compatibility

The hdfs_namenode check is compatible with all major platforms.

Data Collected

Metrics

hdfs.namenode.capacity_total
(gauge)
Total disk capacity in bytes
shown as byte
hdfs.namenode.capacity_used
(gauge)
Disk usage in bytes
shown as byte
hdfs.namenode.capacity_remaining
(gauge)
Remaining disk space left in bytes
shown as byte
hdfs.namenode.total_load
(gauge)
Total load on the file system
shown as
hdfs.namenode.fs_lock_queue_length
(gauge)
Lock queue length
shown as
hdfs.namenode.blocks_total
(gauge)
Total number of blocks
shown as block
hdfs.namenode.max_objects
(gauge)
Maximum number of files HDFS supports
shown as object
hdfs.namenode.files_total
(gauge)
Total number of files
shown as file
hdfs.namenode.pending_replication_blocks
(gauge)
Number of blocks pending replication
shown as block
hdfs.namenode.under_replicated_blocks
(gauge)
Number of under replicated blocks
shown as block
hdfs.namenode.scheduled_replication_blocks
(gauge)
Number of blocks scheduled for replication
shown as block
hdfs.namenode.pending_deletion_blocks
(gauge)
Number of pending deletion blocks
shown as block
hdfs.namenode.num_live_data_nodes
(gauge)
Total number of live data nodes
shown as node
hdfs.namenode.num_dead_data_nodes
(gauge)
Total number of dead data nodes
shown as node
hdfs.namenode.num_decom_live_data_nodes
(gauge)
Number of decommissioning live data nodes
shown as node
hdfs.namenode.num_decom_dead_data_nodes
(gauge)
Number of decommissioning dead data nodes
shown as node
hdfs.namenode.volume_failures_total
(gauge)
Total volume failures
shown as
hdfs.namenode.estimated_capacity_lost_total
(gauge)
Estimated capacity lost in bytes
shown as byte
hdfs.namenode.num_decommissioning_data_nodes
(gauge)
Number of decommissioning data nodes
shown as node
hdfs.namenode.num_stale_data_nodes
(gauge)
Number of stale data nodes
shown as node
hdfs.namenode.num_stale_storages
(gauge)
Number of stale storages
shown as
hdfs.namenode.missing_blocks
(gauge)
Number of missing blocks
shown as block
hdfs.namenode.corrupt_blocks
(gauge)
Number of corrupt blocks
shown as block

Events

The HDFS-namenode check does not include any event at this time.

Service Checks

hdfs.namenode.jmx.can_connect:

Returns Critical if the Agent cannot connect to the NameNode’s JMX interface for any reason (e.g. wrong port provided, timeout, un-parseable JSON response).

Troubleshooting

Need help? Contact Datadog Support.

Further Reading