Datadog-Hadoop YARN Integration

Hadoop Yarn

Overview

This check collects metrics from your YARN ResourceManager, including:

  • Cluster-wide metrics: number of running apps, running containers, unhealthy nodes, etc
  • Per-application metrics: app progress, elapsed running time, running containers, memory use, etc
  • Node metrics: available vCores, time of last health update, etc

And more.

Setup

Installation

The YARN check is packaged with the Agent, so simply install the Agent on your YARN ResourceManager. If you need the newest version of the check, install the dd-check-yarn package.

Configuration

Create a file yarn.yaml in the Agent’s conf.d directory. See the sample yarn.yaml for all available configuration options.:

init_config:

instances:
  - resourcemanager_uri: http://localhost:8088 # or whatever your resource manager listens
    cluster_name: MyCluster # used to tag metrics, i.e. 'cluster_name:MyCluster'; default is 'default_cluster'
    collect_app_metrics: true

See the example check configuration for a comprehensive list and description of all check options.

Restart the Agent to start sending YARN metrics to Datadog.

Validation

Run the Agent’s info subcommand and look for yarn under the Checks section:

  Checks
  ======
    [...]

    yarn
    -------
      - instance #0 [OK]
      - Collected 26 metrics, 0 events & 1 service check

    [...]

Compatibility

The yarn check is compatible with all major platforms.

Data Collected

Metrics

yarn.metrics.apps_submitted
(gauge)
The number of submitted apps
shown as task
yarn.metrics.apps_completed
(gauge)
The number of completed apps
shown as task
yarn.metrics.apps_pending
(gauge)
The number of pending apps
shown as task
yarn.metrics.apps_running
(gauge)
The number of running apps
shown as task
yarn.metrics.apps_failed
(gauge)
The number of failed apps
shown as task
yarn.metrics.apps_killed
(gauge)
The number of killed apps
shown as task
yarn.metrics.reserved_mb
(gauge)
The size of reserved memory
shown as mebibyte
yarn.metrics.available_mb
(gauge)
The amount of available memory
shown as mebibyte
yarn.metrics.allocated_mb
(gauge)
The amount of allocated memory
shown as mebibyte
yarn.metrics.total_mb
(gauge)
The amount of total memory
shown as mebibyte
yarn.metrics.reserved_virtual_cores
(gauge)
The number of reserved virtual cores
shown as core
yarn.metrics.available_virtual_cores
(gauge)
The number of available virtual cores
shown as core
yarn.metrics.allocated_virtual_cores
(gauge)
The number of allocated virtual cores
shown as core
yarn.metrics.total_virtual_cores
(gauge)
The total number of virtual cores
shown as core
yarn.metrics.containers_allocated
(gauge)
The number of containers allocated
shown as
yarn.metrics.containers_reserved
(gauge)
The number of containers reserved
shown as
yarn.metrics.containers_pending
(gauge)
The number of containers pending
shown as
yarn.metrics.total_nodes
(gauge)
The total number of nodes
shown as node
yarn.metrics.active_nodes
(gauge)
The number of active nodes
shown as node
yarn.metrics.lost_nodes
(gauge)
The number of lost nodes
shown as node
yarn.metrics.unhealthy_nodes
(gauge)
The number of unhealthy nodes
shown as node
yarn.metrics.decommissioned_nodes
(gauge)
The number of decommissioned nodes
shown as node
yarn.metrics.rebooted_nodes
(gauge)
The number of rebooted nodes
shown as node
yarn.apps.progress
(rate)
The progress of the application as a percent
shown as percent
yarn.apps.started_time
(rate)
The time in which application started (in ms since epoch)
shown as second
yarn.apps.finished_time
(rate)
The time in which the application finished (in ms since epoch)
shown as second
yarn.apps.elapsed_time
(rate)
The elapsed time since the application started (in ms)
shown as second
yarn.apps.allocated_mb
(rate)
The sum of memory in MB allocated to the applications running containers
shown as mebibyte
yarn.apps.allocated_vcores
(rate)
The sum of virtual cores allocated to the applications running containers
shown as core
yarn.apps.running_containers
(rate)
The number of containers currently running for the application
shown as
yarn.apps.memory_seconds
(rate)
The amount of memory the application has allocated (megabyte-seconds)
shown as second
yarn.apps.vcore_seconds
(rate)
The amount of CPU resources the application has allocated (virtual core-seconds)
shown as second
yarn.node.last_health_update
(gauge)
The last time the node reported its health (in ms since epoch)
shown as millisecond
yarn.node.used_memory_mb
(gauge)
The total amount of memory currently used on the node (in MB)
shown as mebibyte
yarn.node.avail_memory_mb
(gauge)
The total amount of memory currently available on the node (in MB)
shown as mebibyte
yarn.node.used_virtual_cores
(gauge)
The total number of vCores currently used on the node
shown as core
yarn.node.available_virtual_cores
(gauge)
The total number of vCores available on the node
shown as core
yarn.node.num_containers
(gauge)
The total number of containers currently running on the node
shown as
yarn.queue.root.maxCapacity
(gauge)
The configured maximum queue capacity in percentage for root queue
shown as percentage
yarn.queue.root.usedCapacity
(gauge)
The used queue capacity in percentage for root queue
shown as percentage
yarn.queue.root.capacity
(gauge)
The configured queue capacity in percentage for root queue
shown as percentage
yarn.queue.numPendingApplications
(gauge)
The number of pending applications in this queue
shown as task
yarn.queue.userAMResourceLimit.memory
(gauge)
The maximum memory resources a user can use for Application Masters (in MB)
shown as mebibyte
yarn.queue.userAMResourceLimit.vCores
(gauge)
The maximum vCpus a user can use for Application Masters
shown as core
yarn.queue.absoluteCapacity
(gauge)
The absolute capacity percentage this queue can use of entire cluster
shown as percentage
yarn.queue.userLimitFactor
(gauge)
The minimum user limit percent set in the configuration
shown as
yarn.queue.userLimit
(gauge)
The user limit factor set in the configuration
shown as
yarn.queue.numApplications
(gauge)
The number of applications currently in the queue
shown as task
yarn.queue.usedAMResource.memory
(gauge)
The memory resources used for Application Masters (in MB)
shown as mebibyte
yarn.queue.usedAMResource.vCores
(gauge)
The vCpus used for Application Masters
shown as core
yarn.queue.absoluteUsedCapacity
(gauge)
The absolute used capacity percentage this queue is using of the entire cluster
shown as percentage
yarn.queue.resourcesUsed.memory
(gauge)
The total memory resources this queue is using (in MB)
shown as mebibyte
yarn.queue.resourcesUsed.vCores
(gauge)
The total vCpus this queue is using
shown as core
yarn.queue.AMResourceLimit.vCores
(gauge)
The maximum vCpus this queue can use for Application Masters
shown as core
yarn.queue.AMResourceLimit.memory
(gauge)
The maximum memory resources this queue can use for Application Masters (in MB)
shown as mebibyte
yarn.queue.capacity
(gauge)
The configured queue capacity in percentage relative to its parent queue
shown as percentage
yarn.queue.numActiveApplications
(gauge)
The number of active applications in this queue
shown as task
yarn.queue.absoluteMaxCapacity
(gauge)
The absolute maximum capacity percentage this queue can use of the entire cluster
shown as percentage
yarn.queue.usedCapacity
(gauge)
The used queue capacity in percentage
shown as percentage
yarn.queue.numContainers
(gauge)
The number of containers being used
shown as
yarn.queue.maxCapacity
(gauge)
The configured maximum queue capacity in percentage relative to its parent queue
shown as percentage
yarn.queue.maxApplications
(gauge)
The maximum number of applications this queue can have
shown as task
yarn.queue.maxApplicationsPerUser
(gauge)
The maximum number of active applications per user this queue can have
shown as task

Events

The Yarn check does not include any event at this time.

Service Checks

yarn.can_connect:

Returns CRITICAL if the Agent cannot connect to the ResourceManager URI to collect metrics, otherwise OK.

Troubleshooting

Need help? Contact Datadog Support.

Further Reading