Google Cloud Dataproc

Información general

Data Jobs Monitoring te ayuda a observar, solucionar problemas y optimizar los costes de tus tareas de Spark en tus clústeres Dataproc.

Google Cloud Dataproc es un servicio de nube rápido, fácil de utilizar y totalmente gestionado para ejecutar clústeres Apache Spark y Apache Hadoop de una forma más sencilla y rentable.

Utiliza la integración Google Cloud Platform en Datadog para recopilar métricas de Google Cloud Dataproc.

Configuración

Instalación

Si aún no lo has hecho, primero configura la integración Google Cloud Platform. No es necesario realizar ningún otro paso de instalación.

Recopilación de logs

Los logs de Google Cloud Dataproc se recopilan con Google Cloud Logging y se envían a una tarea de Dataflow a través de un tema Cloud Pub/Sub. Si aún no lo has hecho, configura la generación de logs con la plantilla Dataflow de Datadog.

Una vez hecho esto, exporta tus logs de Google Cloud Dataproc de Google Cloud Logging al tema Pub/Sub:

  1. Ve a la página de Google Cloud Logging y filtra logs de Google Cloud Dataproc.
  2. Haz clic en Create Export (Crear exportación) y asigna un nombre al sumidero.
  3. Elige “Cloud Pub/Sub” como destino y selecciona el tema Pub/Sub creado para tal fin. Nota: El tema Pub/Sub puede estar ubicado en un proyecto diferente.
  4. Haz clic en Create (Crear) y espera a que aparezca el mensaje de confirmación.

Datos recopilados

Métricas

gcp.dataproc.batch.spark.executors
(gauge)
Indicates the number of Batch Spark executors.
Shown as worker
gcp.dataproc.cluster.capacity_deviation
(gauge)
Difference between the expected node count in the cluster and the actual active YARN node managers.
gcp.dataproc.cluster.hdfs.datanodes
(gauge)
Indicates the number of HDFS DataNodes that are running inside a cluster.
Shown as node
gcp.dataproc.cluster.hdfs.storage_capacity
(gauge)
Indicates capacity of HDFS system running on a cluster in GB.
Shown as gibibyte
gcp.dataproc.cluster.hdfs.storage_utilization
(gauge)
The percentage of HDFS storage currently used.
Shown as percent
gcp.dataproc.cluster.hdfs.unhealthy_blocks
(gauge)
Indicates the number of unhealthy blocks inside the cluster.
Shown as block
gcp.dataproc.cluster.job.completion_time.avg
(gauge)
The time jobs took to complete from the time the user submits a job to the time Dataproc reports it is completed.
Shown as millisecond
gcp.dataproc.cluster.job.completion_time.samplecount
(count)
Sample count for cluster job completion time.
Shown as millisecond
gcp.dataproc.cluster.job.completion_time.sumsqdev
(gauge)
Sum of squared deviation for cluster job completion time.
Shown as second
gcp.dataproc.cluster.job.duration.avg
(gauge)
The time jobs have spent in a given state.
Shown as millisecond
gcp.dataproc.cluster.job.duration.samplecount
(count)
Sample count for cluster job duration.
Shown as millisecond
gcp.dataproc.cluster.job.duration.sumsqdev
(gauge)
Sum of squared deviation for cluster job duration.
Shown as second
gcp.dataproc.cluster.job.failed_count
(count)
Indicates the number of jobs that have failed on a cluster.
Shown as job
gcp.dataproc.cluster.job.running_count
(gauge)
Indicates the number of jobs that are running on a cluster.
Shown as job
gcp.dataproc.cluster.job.submitted_count
(count)
Indicates the number of jobs that have been submitted to a cluster.
Shown as job
gcp.dataproc.cluster.mig_instances.failed_count
(count)
Indicates the number of instance failures for a managed instance group.
gcp.dataproc.cluster.nodes.expected
(gauge)
Indicates the number of nodes that are expected in a cluster.
Shown as node
gcp.dataproc.cluster.nodes.failed_count
(count)
Indicates the number of nodes that have failed in a cluster.
Shown as node
gcp.dataproc.cluster.nodes.recovered_count
(count)
Indicates the number of nodes that are detected as failed and have been successfully removed from cluster.
Shown as node
gcp.dataproc.cluster.nodes.running
(gauge)
Indicates the number of nodes in running state.
Shown as node
gcp.dataproc.cluster.operation.completion_time.avg
(gauge)
The time operations took to complete from the time the user submits an operation to the time Dataproc reports it is completed.
Shown as millisecond
gcp.dataproc.cluster.operation.completion_time.samplecount
(count)
Sample count for cluster operation completion time.
Shown as millisecond
gcp.dataproc.cluster.operation.completion_time.sumsqdev
(gauge)
Sum of squared deviation for cluster operation completion time.
Shown as second
gcp.dataproc.cluster.operation.duration.avg
(gauge)
The time operations have spent in a given state.
Shown as millisecond
gcp.dataproc.cluster.operation.duration.samplecount
(count)
Sample count for cluster operation duration.
Shown as millisecond
gcp.dataproc.cluster.operation.duration.sumsqdev
(gauge)
Sum of squared deviation for cluster operation duration.
Shown as second
gcp.dataproc.cluster.operation.failed_count
(count)
Indicates the number of operations that have failed on a cluster.
Shown as operation
gcp.dataproc.cluster.operation.running_count
(gauge)
Indicates the number of operations that are running on a cluster.
Shown as operation
gcp.dataproc.cluster.operation.submitted_count
(count)
Indicates the number of operations that have been submitted to a cluster.
Shown as operation
gcp.dataproc.cluster.yarn.allocated_memory_percentage
(gauge)
The percentage of YARN memory is allocated.
Shown as percent
gcp.dataproc.cluster.yarn.apps
(gauge)
Indicates the number of active YARN applications.
gcp.dataproc.cluster.yarn.containers
(gauge)
Indicates the number of YARN containers.
Shown as container
gcp.dataproc.cluster.yarn.memory_size
(gauge)
Indicates the YARN memory size in GB.
Shown as gibibyte
gcp.dataproc.cluster.yarn.nodemanagers
(gauge)
Indicates the number of YARN NodeManagers running inside cluster.
gcp.dataproc.cluster.yarn.pending_memory_size
(gauge)
The current memory request, in GB, that is pending to be fulfilled by the scheduler.
Shown as gibibyte
gcp.dataproc.cluster.yarn.virtual_cores
(gauge)
Indicates the number of virtual cores in YARN.
Shown as core
gcp.dataproc.job.state
(gauge)
Indicates whether job is currently in a particular state or not.
gcp.dataproc.job.yarn.memory_seconds
(gauge)
Indicates the Memory Seconds consumed by the job_id job per yarn application_id.
gcp.dataproc.job.yarn.vcore_seconds
(gauge)
Indicates the VCore Seconds consumed by the job_id job per yarn application_id.
gcp.dataproc.node.problem_count
(count)
Total number of times a specific type of problem has occurred.
gcp.dataproc.node.yarn.nodemanager.health
(gauge)
YARN NodeManager health state.
gcp.dataproc.session.spark.executors
(gauge)
Indicates the number of Session Spark executors.
Shown as worker

Eventos

La integración Google Cloud Dataproc no incluye eventos.

Checks de servicio

La integración Google Cloud Dataproc no incluye checks de servicio.

Solucionar problemas

¿Necesitas ayuda? Ponte en contacto con el servicio de asistencia de Datadog.