Spark
Rapport de recherche Datadog : Bilan sur l'adoption de l'informatique sans serveur Rapport : Bilan sur l'adoption de l'informatique sans serveur

Spark

Agent Check Check de l'Agent

Supported OS: Linux Mac OS Windows

Graphique Spark

Présentation

Ce check permet de surveiller Spark avec l’Agent Datadog. Recueillez des métriques de Spark concernant :

  • Les pilotes et exécuteurs : blocs RDD, mémoire utilisée, espace disque utilisé, durée, etc.
  • Les RDD : nombre de partitions, mémoire utilisée, espace disque utilisé.
  • Les tâches : nombre de tâches actives, ignorées, ayant échoué, totales.
  • Les statuts des jobs : nombre de jobs actifs, terminés, ignorés, ayant échoué.

Remarque : Les métriques Spark Structured Streaming ne sont pas prises en charge.

Configuration

Installation

Le check Spark est inclus avec le paquet de l’Agent Datadog. Vous n’avez donc rien d’autre à installer sur votre master Mesos (pour Spark sur Mesos), ResourceManager YARN (pour Spark sur YARN) ou master Spark (pour Spark en mode standalone).

Configuration

Host

Suivez les instructions ci-dessous pour configurer ce check lorsque l’Agent est exécuté sur un host. Consultez la section Environnement conteneurisé pour en savoir plus sur les environnements conteneurisés.

  1. Modifiez le fichier spark.d/conf.yaml dans le dossier conf.d/ à la racine du répertoire de configuration de votre Agent. Les paramètres suivants peuvent nécessiter une mise à jour. Consultez le fichier d’exemple spark.d/conf.yaml pour découvrir toutes les options de configuration disponibles.

    init_config:
    
    instances:
     - spark_url: http://localhost:8080 # Spark master web UI
       #   spark_url: http://<Mesos_master>:5050 # Mesos master web UI
       #   spark_url: http://<YARN_ResourceManager_address>:8088 # YARN ResourceManager address
    
       spark_cluster_mode: spark_standalone_mode # default
       #   spark_cluster_mode: spark_mesos_mode
       #   spark_cluster_mode: spark_yarn_mode
       #   spark_cluster_mode: spark_driver_mode
    
       # required; adds a tag 'cluster_name:<CLUSTER_NAME>' to all metrics
       cluster_name: "<CLUSTER_NAME>"
       # spark_pre_20_mode: true   # if you use Standalone Spark < v2.0
       # spark_proxy_enabled: true # if you have enabled the spark UI proxy
  2. Redémarrez l’Agent.

Environnement conteneurisé

Consultez la documentation relative aux modèles d’intégration Autodiscovery pour découvrir comment appliquer les paramètres ci-dessous à un environnement conteneurisé.

ParamètreValeur
<NOM_INTÉGRATION>spark
<CONFIG_INIT>vide ou {}
<CONFIG_INSTANCE>{"spark_url": "%%host%%:8080", "cluster_name":"<NOM_CLUSTER>"}

Validation

Lancez la sous-commande status de l’Agent et cherchez spark dans la section Checks.

Données collectées

Métriques

spark.job.count
(count)
Number of jobs
Shown as task
spark.job.num_tasks
(count)
Number of tasks in the application
Shown as task
spark.job.num_active_tasks
(count)
Number of active tasks in the application
Shown as task
spark.job.num_skipped_tasks
(count)
Number of skipped tasks in the application
Shown as task
spark.job.num_failed_tasks
(count)
Number of failed tasks in the application
Shown as task
spark.job.num_completed_tasks
(count)
Number of completed tasks in the application
Shown as task
spark.job.num_active_stages
(count)
Number of active stages in the application
Shown as stage
spark.job.num_completed_stages
(count)
Number of completed stages in the application
Shown as stage
spark.job.num_skipped_stages
(count)
Number of skipped stages in the application
Shown as stage
spark.job.num_failed_stages
(count)
Number of failed stages in the application
Shown as stage
spark.stage.count
(count)
Number of stages
Shown as task
spark.stage.num_active_tasks
(count)
Number of active tasks in the application's stages
Shown as task
spark.stage.num_complete_tasks
(count)
Number of complete tasks in the application's stages
Shown as task
spark.stage.num_failed_tasks
(count)
Number of failed tasks in the application's stages
Shown as task
spark.stage.executor_run_time
(count)
Time spent by the executor in the application's stages
Shown as millisecond
spark.stage.input_bytes
(count)
Input bytes in the application's stages
Shown as byte
spark.stage.input_records
(count)
Input records in the application's stages
Shown as record
spark.stage.output_bytes
(count)
Output bytes in the application's stages
Shown as byte
spark.stage.output_records
(count)
Output records in the application's stages
Shown as record
spark.stage.shuffle_read_bytes
(count)
Number of bytes read during a shuffle in the application's stages
Shown as byte
spark.stage.shuffle_read_records
(count)
Number of records read during a shuffle in the application's stages
Shown as record
spark.stage.shuffle_write_bytes
(count)
Number of shuffled bytes in the application's stages
Shown as byte
spark.stage.shuffle_write_records
(count)
Number of shuffled records in the application's stages
Shown as record
spark.stage.memory_bytes_spilled
(count)
Number of bytes spilled to disk in the application's stages
Shown as byte
spark.stage.disk_bytes_spilled
(count)
Max size on disk of the spilled bytes in the application's stages
Shown as byte
spark.driver.rdd_blocks
(count)
Number of RDD blocks in the driver
Shown as block
spark.driver.memory_used
(count)
Amount of memory used in the driver
Shown as byte
spark.driver.disk_used
(count)
Amount of disk used in the driver
Shown as byte
spark.driver.active_tasks
(count)
Number of active tasks in the driver
Shown as task
spark.driver.failed_tasks
(count)
Number of failed tasks in the driver
Shown as task
spark.driver.completed_tasks
(count)
Number of completed tasks in the driver
Shown as task
spark.driver.total_tasks
(count)
Number of total tasks in the driver
Shown as task
spark.driver.total_duration
(count)
Time spent in the driver
Shown as millisecond
spark.driver.total_input_bytes
(count)
Number of input bytes in the driver
Shown as byte
spark.driver.total_shuffle_read
(count)
Number of bytes read during a shuffle in the driver
Shown as byte
spark.driver.total_shuffle_write
(count)
Number of shuffled bytes in the driver
Shown as byte
spark.driver.max_memory
(count)
Maximum memory used in the driver
Shown as byte
spark.executor.count
(count)
Number of executors
Shown as task
spark.executor.rdd_blocks
(count)
Number of persisted RDD blocks in the application's executors
Shown as block
spark.executor.memory_used
(count)
Amount of memory used for cached RDDs in the application's executors
Shown as byte
spark.executor.max_memory
(count)
Max memory across all executors working for a particular application
Shown as byte
spark.executor.disk_used
(count)
Amount of disk space used by persisted RDDs in the application's executors
Shown as byte
spark.executor.active_tasks
(count)
Number of active tasks in the application's executors
Shown as task
spark.executor.failed_tasks
(count)
Number of failed tasks in the application's executors
Shown as task
spark.executor.completed_tasks
(count)
Number of completed tasks in the application's executors
Shown as task
spark.executor.total_tasks
(count)
Total number of tasks in the application's executors
Shown as task
spark.executor.total_duration
(count)
Time spent by the application's executors executing tasks
Shown as millisecond
spark.executor.total_input_bytes
(count)
Total number of input bytes in the application's executors
Shown as byte
spark.executor.total_shuffle_read
(count)
Total number of bytes read during a shuffle in the application's executors
Shown as byte
spark.executor.total_shuffle_write
(count)
Total number of shuffled bytes in the application's executors
Shown as byte
spark.executor_memory
(count)
Maximum memory available for caching RDD blocks in the application's executors
Shown as byte
spark.rdd.count
(count)
Number of RDDs
spark.rdd.num_partitions
(count)
Number of persisted RDD partitions in the application
spark.rdd.num_cached_partitions
(count)
Number of in-memory cached RDD partitions in the application
spark.rdd.memory_used
(count)
Amount of memory used in the application's persisted RDDs
Shown as byte
spark.rdd.disk_used
(count)
Amount of disk space used by persisted RDDs in the application
Shown as byte
spark.streaming.statistics.avg_input_rate
(gauge)
Average streaming input data rate
Shown as byte
spark.streaming.statistics.avg_processing_time
(gauge)
Average application's streaming batch processing time
Shown as millisecond
spark.streaming.statistics.avg_scheduling_delay
(gauge)
Average application's streaming batch scheduling delay
Shown as millisecond
spark.streaming.statistics.avg_total_delay
(gauge)
Average application's streaming batch total delay
Shown as millisecond
spark.streaming.statistics.batch_duration
(gauge)
Application's streaming batch duration
Shown as millisecond
spark.streaming.statistics.num_active_batches
(gauge)
Number of active streaming batches
Shown as job
spark.streaming.statistics.num_active_receivers
(gauge)
Number of active streaming receivers
Shown as object
spark.streaming.statistics.num_inactive_receivers
(gauge)
Number of inactive streaming receivers
Shown as object
spark.streaming.statistics.num_processed_records
(count)
Number of processed streaming records
Shown as record
spark.streaming.statistics.num_received_records
(count)
Number of received streaming records
Shown as record
spark.streaming.statistics.num_receivers
(gauge)
Number of streaming application's receivers
Shown as object
spark.streaming.statistics.num_retained_completed_batches
(count)
Number of retained completed application's streaming batches
Shown as job
spark.streaming.statistics.num_total_completed_batches
(count)
Total number of completed application's streaming batches
Shown as job

Événements

Le check Spark n’inclut aucun événement.

Checks de service

L’Agent envoie l’un des checks de service suivants, selon la façon dont vous exécutez Spark :

spark.standalone_master.can_connect
Renvoie CRITICAL si l’Agent n’est pas capable de se connecter au master standalone de l’instance Spark. Si ce n’est pas le cas, renvoie OK.

spark.mesos_master.can_connect
Renvoie CRITICAL si l’Agent n’est pas capable de se connecter au master Mesos de l’instance Spark. Si ce n’est pas le cas, renvoie OK.

spark.application_master.can_connect
Renvoie CRITICAL si l’Agent n’est pas capable de se connecter à l’ApplicationMaster de l’instance Spark. Si ce n’est pas le cas, renvoie OK.

spark.resource_manager.can_connect
Renvoie CRITICAL si l’Agent n’est pas capable de se connecter au ResourceManager de l’instance Spark. Si ce n’est pas le cas, renvoie OK.

spark.driver.can_connect
Renvoie CRITICAL si l’Agent n’est pas capable de se connecter au ResourceManager de l’instance Spark. Si ce n’est pas le cas, renvoie OK.

Dépannage

Spark sur AWS EMR

Pour recueillir des métriques Spark lorsque Spark est configuré sur AWS EMR, utilisez les actions Bootstrap pour installer l’Agent Datadog puis créez le fichier de configuration /etc/dd-agent/conf.d/spark.yaml avec les valeurs appropriées pour chaque nœud EMR.

Pour aller plus loin

Documentation, liens et articles supplémentaires utiles :