The Service Map for APM is here!

Spark

Agent Check Agent Check

Supported OS: Linux Mac OS Windows

Spark Graph

Overview

The Spark check collects metrics for:

  • Drivers and executors: RDD blocks, memory used, disk used, duration, etc.
  • RDDs: partition count, memory used, disk used
  • Tasks: number of tasks active, skipped, failed, total
  • Job state: number of jobs active, completed, skipped, failed

Setup

Installation

The Spark check is included in the Datadog Agent package, so you don’t need to install anything else on your:

  • Mesos master (if you’re running Spark on Mesos),
  • YARN ResourceManager (if you’re running Spark on YARN), or
  • Spark master (if you’re running Standalone Spark)

Configuration

  1. Edit the spark.d/conf.yaml file, in the conf.d/ folder at the root of your Agent’s configuration directory. See the sample spark.d/conf.yaml for all available configuration options:

        init_config:
    
        instances:
          - spark_url: http://localhost:8088 # Spark master web UI
        #   spark_url: http://<Mesos_master>:5050 # Mesos master web UI
        #   spark_url: http://<YARN_ResourceManager_address>:8088 # YARN ResourceManager address
    
            spark_cluster_mode: spark_standalone_mode # default is spark_yarn_mode
        #   spark_cluster_mode: spark_mesos_mode
        #   spark_cluster_mode: spark_yarn_mode
    
            cluster_name: <CLUSTER_NAME> # required; adds a tag 'cluster_name:<CLUSTER_NAME>' to all metrics
    
        #   spark_pre_20_mode: true   # if you use Standalone Spark < v2.0
        #   spark_proxy_enabled: true # if you have enabled the spark UI proxy

    Set spark_url and spark_cluster_mode according to how you’re running Spark.

  2. Restart the Agent to start sending Spark metrics to Datadog.

Validation

Run the Agent’s status subcommand and look for spark under the Checks section.

Data Collected

Metrics

spark.job.count
(rate)
Number of jobs
shown as task
spark.job.num_tasks
(rate)
Number of tasks in the application
shown as task
spark.job.num_active_tasks
(rate)
Number of active tasks in the application
shown as task
spark.job.num_skipped_tasks
(rate)
Number of skipped tasks in the application
shown as task
spark.job.num_failed_tasks
(rate)
Number of failed tasks in the application
shown as task
spark.job.num_completed_tasks
(rate)
Number of completed tasks in the application
shown as task
spark.job.num_active_stages
(rate)
Number of active stages in the application
shown as stage
spark.job.num_completed_stages
(rate)
Number of completed stages in the application
shown as stage
spark.job.num_skipped_stages
(rate)
Number of skipped stages in the application
shown as stage
spark.job.num_failed_stages
(rate)
Number of failed stages in the application
shown as stage
spark.stage.count
(rate)
Number of stages
shown as task
spark.stage.num_active_tasks
(rate)
Number of active tasks in the application's stages
shown as task
spark.stage.num_complete_tasks
(rate)
Number of complete tasks in the application's stages
shown as task
spark.stage.num_failed_tasks
(rate)
Number of failed tasks in the application's stages
shown as task
spark.stage.executor_run_time
(gauge)
Fraction of time (ms/s) spent by the executor in the application's stages
shown as fraction
spark.stage.input_bytes
(rate)
Input bytes in the application's stages
shown as byte
spark.stage.input_records
(rate)
Input records in the application's stages
shown as record
spark.stage.output_bytes
(rate)
Output bytes in the application's stages
shown as byte
spark.stage.output_records
(rate)
Output records in the application's stages
shown as record
spark.stage.shuffle_read_bytes
(rate)
Number of bytes read during a shuffle in the application's stages
shown as byte
spark.stage.shuffle_read_records
(rate)
Number of records read during a shuffle in the application's stages
shown as record
spark.stage.shuffle_write_bytes
(rate)
Number of shuffled bytes in the application's stages
shown as byte
spark.stage.shuffle_write_records
(rate)
Number of shuffled records in the application's stages
shown as record
spark.stage.memory_bytes_spilled
(rate)
Number of bytes spilled to disk in the application's stages
shown as byte
spark.stage.disk_bytes_spilled
(rate)
Max size on disk of the spilled bytes in the application's stages
shown as byte
spark.driver.rdd_blocks
(rate)
Number of RDD blocks in the driver
shown as block
spark.driver.memory_used
(rate)
Amount of memory used in the driver
shown as byte
spark.driver.disk_used
(rate)
Amount of disk used in the driver
shown as byte
spark.driver.active_tasks
(rate)
Number of active tasks in the driver
shown as task
spark.driver.failed_tasks
(rate)
Number of failed tasks in the driver
shown as task
spark.driver.completed_tasks
(rate)
Number of completed tasks in the driver
shown as task
spark.driver.total_tasks
(rate)
Number of total tasks in the driver
shown as task
spark.driver.total_duration
(gauge)
Fraction of time (ms/s) spent by the driver
shown as fraction
spark.driver.total_input_bytes
(rate)
Number of input bytes in the driver
shown as byte
spark.driver.total_shuffle_read
(rate)
Number of bytes read during a shuffle in the driver
shown as byte
spark.driver.total_shuffle_write
(rate)
Number of shuffled bytes in the driver
shown as byte
spark.driver.max_memory
(rate)
Maximum memory used in the driver
shown as byte
spark.executor.count
(rate)
Number of executors
shown as task
spark.executor.rdd_blocks
(rate)
Number of persisted RDD blocks in the application's executors
shown as block
spark.executor.memory_used
(rate)
Amount of memory used for cached RDDs in the application's executors
shown as byte
spark.executor.max_memory
(rate)
Max memory across all executors working for a particular application
shown as byte
spark.executor.disk_used
(rate)
Amount of disk space used by persisted RDDs in the application's executors
shown as byte
spark.executor.active_tasks
(rate)
Number of active tasks in the application's executors
shown as task
spark.executor.failed_tasks
(rate)
Number of failed tasks in the application's executors
shown as task
spark.executor.completed_tasks
(rate)
Number of completed tasks in the application's executors
shown as task
spark.executor.total_tasks
(rate)
Total number of tasks in the application's executors
shown as task
spark.executor.total_duration
(gauge)
Fraction of time (ms/s) spent by the application's executors executing tasks
shown as fraction
spark.executor.total_input_bytes
(rate)
Total number of input bytes in the application's executors
shown as byte
spark.executor.total_shuffle_read
(rate)
Total number of bytes read during a shuffle in the application's executors
shown as byte
spark.executor.total_shuffle_write
(rate)
Total number of shuffled bytes in the application's executors
shown as byte
spark.executor_memory
(rate)
Maximum memory available for caching RDD blocks in the application's executors
shown as byte
spark.rdd.count
(rate)
Number of RDDs
spark.rdd.num_partitions
(rate)
Number of persisted RDD partitions in the application
spark.rdd.num_cached_partitions
(rate)
Number of in-memory cached RDD partitions in the application
spark.rdd.memory_used
(rate)
Amount of memory used in the application's persisted RDDs
shown as byte
spark.rdd.disk_used
(rate)
Amount of disk space used by persisted RDDs in the application
shown as byte

Events

The Spark check does not include any events at this time.

Service Checks

The Agent submits one of the following service checks, depending on how you’re running Spark:

  • spark.standalone_master.can_connect
  • spark.mesos_master.can_connect
  • spark.application_master.can_connect
  • spark.resource_manager.can_connect

The checks return CRITICAL if the Agent cannot collect Spark metrics, otherwise OK.

Troubleshooting

Spark on AWS EMR.

To get Spark metrics if Spark is set up on AWS EMR, use bootstrap actions to install the Datadog Agent and then create the /etc/dd-agent/conf.d/spark.yaml configuration file with the proper values on each EMR node.

Further Reading


Mistake in the docs? Feel free to contribute!