The Service Map for APM is here!

Hdfs

HDFS DataNode Integration

HDFS Dashboard

Overview

Track disk utilization and failed volumes on each of your HDFS DataNodes. This Agent check collects metrics for these, as well as block- and cache-related metrics.

Use this check (hdfs_datanode) and its counterpart check (hdfs_namenode), not the older two-in-one check (hdfs); that check is deprecated.

Setup

Installation

The HDFS DataNode check is included in the Datadog Agent package, so you don’t need to install anything else on your DataNodes.

Configuration

Prepare the DataNode

The Agent collects metrics from the DataNode’s JMX remote interface. The interface is disabled by default, so enable it by setting the following option in hadoop-env.sh (usually found in $HADOOP_HOME/conf):

export HADOOP_DATANODE_OPTS="-Dcom.sun.management.jmxremote
  -Dcom.sun.management.jmxremote.authenticate=false
  -Dcom.sun.management.jmxremote.ssl=false
  -Dcom.sun.management.jmxremote.port=50075 $HADOOP_DATANODE_OPTS"

Restart the DataNode process to enable the JMX interface.

Connect the Agent

Edit the hdfs_datanode.d/conf.yaml file, in the conf.d/ folder at the root of your Agent’s configuration directory. See the sample hdfs_datanode.d/conf.yaml for all available configuration options:

init_config:

instances:
  - hdfs_datanode_jmx_uri: http://localhost:50075

Restart the Agent to begin sending DataNode metrics to Datadog.

Validation

Run the Agent’s status subcommand and look for hdfs_datanode under the Checks section.

Data Collected

Metrics

hdfs.datanode.dfs_remaining
(gauge)
The remaining disk space left in bytes
shown as byte
hdfs.datanode.dfs_capacity
(gauge)
Disk capacity in bytes
shown as byte
hdfs.datanode.dfs_used
(gauge)
Disk usage in bytes
shown as byte
hdfs.datanode.cache_capacity
(gauge)
Cache capacity in bytes
shown as byte
hdfs.datanode.cache_used
(gauge)
Cache used in bytes
shown as byte
hdfs.datanode.num_failed_volumes
(gauge)
Number of failed volumes
hdfs.datanode.last_volume_failure_date
(gauge)
The date/time of the last volume failure in milliseconds since epoch
shown as millisecond
hdfs.datanode.estimated_capacity_lost_total
(gauge)
The estimated capacity lost in bytes
shown as byte
hdfs.datanode.num_blocks_cached
(gauge)
The number of blocks cached
shown as block
hdfs.datanode.num_blocks_failed_to_cache
(gauge)
The number of blocks that failed to cache
shown as block
hdfs.datanode.num_blocks_failed_to_uncache
(gauge)
The number of failed blocks to remove from cache
shown as block

Events

The HDFS-datanode check does not include any events at this time.

Service Checks

hdfs.datanode.jmx.can_connect:

Returns Critical if the Agent cannot connect to the DataNode’s JMX interface for any reason (e.g. wrong port provided, timeout, un-parseable JSON response).

Troubleshooting

Need help? Contact Datadog Support.

Further Reading

HDFS NameNode Integration

HDFS Dashboard

Overview

Monitor your primary and standby HDFS NameNodes to know when your cluster falls into a precarious state: when you’re down to one NameNode remaining, or when it’s time to add more capacity to the cluster. This Agent check collects metrics for remaining capacity, corrupt/missing blocks, dead DataNodes, filesystem load, under-replicated blocks, total volume failures (across all DataNodes), and many more.

Use this check (hdfs_namenode) and its counterpart check (hdfs_datanode), not the older two-in-one check (hdfs); that check is deprecated.

Setup

Installation

The HDFS NameNode check is included in the Datadog Agent package, so you don’t need to install anything else on your NameNodes.

Configuration

Prepare the NameNode

The Agent collects metrics from the NameNode’s JMX remote interface. The interface is disabled by default, so enable it by setting the following option in hadoop-env.sh (usually found in $HADOOP_HOME/conf):

export HADOOP_NAMENODE_OPTS="-Dcom.sun.management.jmxremote
  -Dcom.sun.management.jmxremote.authenticate=false
  -Dcom.sun.management.jmxremote.ssl=false
  -Dcom.sun.management.jmxremote.port=50070 $HADOOP_NAMENODE_OPTS"

Restart the NameNode process to enable the JMX interface.

Connect the Agent

Edit the hdfs_namenode.d/conf.yaml file, in the conf.d/ folder at the root of your Agent’s configuration directory. See the sample hdfs_namenode.d/conf.yaml for all available configuration options:

init_config:

instances:
  - hdfs_namenode_jmx_uri: http://localhost:50070

Restart the Agent to begin sending NameNode metrics to Datadog.

Validation

Run the Agent’s status subcommand and look for hdfs_namenode under the Checks section.

Data Collected

Metrics

hdfs.namenode.capacity_total
(gauge)
Total disk capacity in bytes
shown as byte
hdfs.namenode.capacity_used
(gauge)
Disk usage in bytes
shown as byte
hdfs.namenode.capacity_remaining
(gauge)
Remaining disk space left in bytes
shown as byte
hdfs.namenode.total_load
(gauge)
Total load on the file system
hdfs.namenode.fs_lock_queue_length
(gauge)
Lock queue length
hdfs.namenode.blocks_total
(gauge)
Total number of blocks
shown as block
hdfs.namenode.max_objects
(gauge)
Maximum number of files HDFS supports
shown as object
hdfs.namenode.files_total
(gauge)
Total number of files
shown as file
hdfs.namenode.pending_replication_blocks
(gauge)
Number of blocks pending replication
shown as block
hdfs.namenode.under_replicated_blocks
(gauge)
Number of under replicated blocks
shown as block
hdfs.namenode.scheduled_replication_blocks
(gauge)
Number of blocks scheduled for replication
shown as block
hdfs.namenode.pending_deletion_blocks
(gauge)
Number of pending deletion blocks
shown as block
hdfs.namenode.num_live_data_nodes
(gauge)
Total number of live data nodes
shown as node
hdfs.namenode.num_dead_data_nodes
(gauge)
Total number of dead data nodes
shown as node
hdfs.namenode.num_decom_live_data_nodes
(gauge)
Number of decommissioning live data nodes
shown as node
hdfs.namenode.num_decom_dead_data_nodes
(gauge)
Number of decommissioning dead data nodes
shown as node
hdfs.namenode.volume_failures_total
(gauge)
Total volume failures
hdfs.namenode.estimated_capacity_lost_total
(gauge)
Estimated capacity lost in bytes
shown as byte
hdfs.namenode.num_decommissioning_data_nodes
(gauge)
Number of decommissioning data nodes
shown as node
hdfs.namenode.num_stale_data_nodes
(gauge)
Number of stale data nodes
shown as node
hdfs.namenode.num_stale_storages
(gauge)
Number of stale storages
hdfs.namenode.missing_blocks
(gauge)
Number of missing blocks
shown as block
hdfs.namenode.corrupt_blocks
(gauge)
Number of corrupt blocks
shown as block

Events

The HDFS-namenode check does not include any events at this time.

Service Checks

hdfs.namenode.jmx.can_connect:

Returns Critical if the Agent cannot connect to the NameNode’s JMX interface for any reason (e.g. wrong port provided, timeout, un-parseable JSON response).

Troubleshooting

Need help? Contact Datadog Support.

Further Reading