Datadog-Spark Integration

spark graph

Overview

The Spark check collects metrics for:

  • Drivers and executors: RDD blocks, memory used, disk used, duration, etc.
  • RDDs: partition count, memory used, disk used
  • Tasks: number of tasks active, skipped, failed, total
  • Job state: number of jobs active, completed, skipped, failed

Setup

Installation

The Spark check is packaged with the Agent, so simply install the Agent on your:

  • Mesos master (if you’re running Spark on Mesos),
  • YARN ResourceManager (if you’re running Spark on YARN), or
  • Spark master (if you’re running Standalone Spark)

If you need the newest version of the check, install the dd-check-spark package.

Configuration

Create a file spark.yaml in the Agent’s conf.d directory. See the sample spark.yaml for all available configuration options:

init_config:

instances:
  - spark_url: http://localhost:8088 # Spark master web UI 
#   spark_url: http://<Mesos_master>:5050 # Mesos master web UI
#   spark_url: http://<YARN_ResourceManager_address>:8088 # YARN ResourceManager address

    spark_cluster_mode: spark_standalone_mode # default is spark_yarn_mode
#   spark_cluster_mode: spark_mesos_mode
#   spark_cluster_mode: spark_yarn_mode

    cluster_name: <CLUSTER_NAME> # required; adds a tag 'cluster_name:<CLUSTER_NAME>' to all metrics

#   spark_pre_20_mode: true   # if you use Standalone Spark < v2.0
#   spark_proxy_enabled: true # if you have enabled the spark UI proxy

Set spark_url and spark_cluster_mode according to how you’re running Spark.

Restart the Agent to start sending Spark metrics to Datadog.

Validation

Run the Agent’s info subcommand and look for spark under the Checks section:

  Checks
  ======
    [...]

    spark
    -------
      - instance #0 [OK]
      - Collected 26 metrics, 0 events & 1 service check

    [...]

Compatibility

The spark check is compatible with all major platforms.

Data Collected

Metrics

spark.job.num_tasks
(rate)
Number of tasks in the application
shown as task
spark.job.num_active_tasks
(rate)
Number of active tasks in the application
shown as task
spark.job.num_skipped_tasks
(rate)
Number of skipped tasks in the application
shown as task
spark.job.num_failed_tasks
(rate)
Number of failed tasks in the application
shown as task
spark.job.num_active_stages
(rate)
Number of active stages in the application
shown as stage
spark.job.num_completed_stages
(rate)
Number of completed stages in the application
shown as stage
spark.job.num_skipped_stages
(rate)
Number of skipped stages in the application
shown as stage
spark.job.num_failed_stages
(rate)
Number of failed stages in the application
shown as stage
spark.stage.num_active_tasks
(rate)
Number of active tasks in the application's stages
shown as task
spark.stage.num_complete_tasks
(rate)
Number of complete tasks in the application's stages
shown as task
spark.stage.num_failed_tasks
(rate)
Number of failed tasks in the application's stages
shown as task
spark.stage.executor_run_time
(gauge)
Fraction of time (ms/s) spent by the executor in the application's stages
shown as fraction
spark.stage.input_bytes
(rate)
Input bytes in the application's stages
shown as byte
spark.stage.input_records
(rate)
Input records in the application's stages
shown as record
spark.stage.output_bytes
(rate)
Output bytes in the application's stages
shown as byte
spark.stage.output_records
(rate)
Output records in the application's stages
shown as record
spark.stage.shuffle_read_bytes
(rate)
Number of bytes read during a shuffle in the application's stages
shown as byte
spark.stage.shuffle_read_records
(rate)
Number of records read during a shuffle in the application's stages
shown as record
spark.stage.shuffle_write_bytes
(rate)
Number of shuffled bytes in the application's stages
shown as byte
spark.stage.shuffle_write_records
(rate)
Number of shuffled records in the application's stages
shown as record
spark.stage.memory_bytes_spilled
(rate)
Number of bytes spilled to disk in the application's stages
shown as byte
spark.stage.disk_bytes_spilled
(rate)
Max size on disk of the spilled bytes in the application's stages
shown as byte
spark.driver.rdd_blocks
(rate)
Number of RDD blocks in the driver
shown as block
spark.driver.memory_used
(rate)
Amount of memory used in the driver
shown as byte
spark.driver.disk_used
(rate)
Amount of disk used in the driver
shown as byte
spark.driver.active_tasks
(rate)
Number of active tasks in the driver
shown as task
spark.driver.failed_tasks
(rate)
Number of failed tasks in the driver
shown as task
spark.driver.completed_tasks
(rate)
Number of completed tasks in the driver
shown as task
spark.driver.total_tasks
(rate)
Number of total tasks in the driver
shown as task
spark.driver.total_duration
(gauge)
Fraction of time (ms/s) spent by the driver
shown as fraction
spark.driver.total_input_bytes
(rate)
Number of input bytes in the driver
shown as byte
spark.driver.total_shuffle_read
(rate)
Number of bytes read during a shuffle in the driver
shown as byte
spark.driver.total_shuffle_write
(rate)
Number of shuffled bytes in the driver
shown as byte
spark.driver.max_memory
(rate)
Maximum memory used in the driver
shown as byte
spark.executor.rdd_blocks
(rate)
Number of persisted RDD blocks in the application's executors
shown as block
spark.executor.memory_used
(rate)
Amount of memory used for cached RDDs in the application's executors
shown as byte
spark.executor.disk_used
(rate)
Amount of disk space used by persisted RDDs in the application's executors
shown as byte
spark.executor.active_tasks
(rate)
Number of active tasks in the application's executors
shown as task
spark.executor.failed_tasks
(rate)
Number of failed tasks in the application's executors
shown as task
spark.executor.completed_tasks
(rate)
Number of completed tasks in the application's executors
shown as task
spark.executor.total_tasks
(rate)
Total number of tasks in the application's executors
shown as task
spark.executor.total_duration
(gauge)
Fraction of time (ms/s) spent by the application's executors executing tasks
shown as fraction
spark.executor.total_input_bytes
(rate)
Total number of input bytes in the application's executors
shown as byte
spark.executor.total_shuffle_read
(rate)
Total number of bytes read during a shuffle in the application's executors
shown as byte
spark.executor.total_shuffle_write
(rate)
Total number of shuffled bytes in the application's executors
shown as byte
spark.executor_memory
(rate)
Maximum memory available for caching RDD blocks in the application's executors
shown as byte
spark.rdd.num_partitions
(rate)
Number of persisted RDD partitions in the application
shown as
spark.rdd.num_cached_partitions
(rate)
Number of in-memory cached RDD partitions in the application
shown as
spark.rdd.memory_used
(rate)
Amount of memory used in the application's persisted RDDs
shown as byte
spark.rdd.disk_used
(rate)
Amount of disk space used by persisted RDDs in the application
shown as byte

Events

The Spark check does not include any event at this time.

Service Checks

The Agent submits one of the following service checks, depending on how you’re running Spark:

  • spark.standalone_master.can_connect
  • spark.mesos_master.can_connect
  • spark.resource_manager.can_connect

The checks return CRITICAL if the Agent cannot collect Spark metrics, otherwise OK.

Troubleshooting

Need help? Contact Datadog Support.

Further Reading