Supported OS Linux Mac OS

Versión de la integración4.0.0

dashboard de Ceph

Información general

Habilita la integración de Ceph con Datadog para:

  • Rastrear el uso del disco en los grupos de almacenamiento
  • Recibir checks de servicio en caso de problemas
  • Monitorizar las métricas de rendimiento de E/S

Configuración

Instalación

El check de Ceph está incluido en el paquete del Datadog Agent, por lo que no necesitas instalar nada más en tus servidores Ceph.

Configuración

Edita el archivo ceph.d/conf.yaml, que se encuentra en la carpeta conf.d/ en la raíz del directorio de configuración del Agent. Para ver todas las opciones de configuración disponibles, consulta el ceph.d/conf.yaml de ejemplo:

init_config:

instances:
  - ceph_cmd: /path/to/your/ceph # la opción predeterminada es /usr/bin/ceph
    use_sudo: true # solo si el binario ceph necesita sudo en los nodos

Si has habilitado use_sudo, añade una línea como la siguiente a tu archivo sudoers:

dd-agent ALL=(ALL) NOPASSWD:/ruta/a/tu/ceph

Recopilación de logs

Disponible para la versión 6.0 o posteriores del Agent

  1. La recopilación de logs está deshabilitada por defecto en el Datadog Agent; habilítala en tu archivo datadog.yaml:

    logs_enabled: true
    
  2. Luego, edita ceph.d/conf.yaml al quitar los comentarios de las líneas logs de la parte inferior. Actualiza la path de los logs con la ruta correcta a tus archivos de logs de Ceph.

    logs:
      - type: file
        path: /var/log/ceph/*.log
        source: ceph
        service: "<APPLICATION_NAME>"
    
  3. Reinicia el Agent.

Validación

Ejecuta el subcomando de estado del Agent y busca ceph en la sección Checks.

Datos recopilados

Métricas

ceph.aggregate_pct_used
(gauge)
Overall capacity usage metric
Shown as percent
ceph.apply_latency_ms
(gauge)
Time taken to flush an update to disks
Shown as millisecond
ceph.commit_latency_ms
(gauge)
Time taken to commit an operation to the journal
Shown as millisecond
ceph.misplaced_objects
(gauge)
Number of objects misplaced
Shown as item
ceph.misplaced_total
(gauge)
Total number of objects if there are misplaced objects
Shown as item
ceph.num_full_osds
(gauge)
Number of full osds
Shown as item
ceph.num_in_osds
(gauge)
Number of participating storage daemons
Shown as item
ceph.num_mons
(gauge)
Number of monitor daemons
Shown as item
ceph.num_near_full_osds
(gauge)
Number of nearly full osds
Shown as item
ceph.num_objects
(gauge)
Object count for a given pool
Shown as item
ceph.num_osds
(gauge)
Number of known storage daemons
Shown as item
ceph.num_pgs
(gauge)
Number of placement groups available
Shown as item
ceph.num_pools
(gauge)
Number of pools
Shown as item
ceph.num_up_osds
(gauge)
Number of online storage daemons
Shown as item
ceph.op_per_sec
(gauge)
IO operations per second for given pool
Shown as operation
ceph.osd.pct_used
(gauge)
Percentage used of full/near full osds
Shown as percent
ceph.pgstate.active_clean
(gauge)
Number of active+clean placement groups
Shown as item
ceph.read_bytes
(gauge)
Per-pool read bytes
Shown as byte
ceph.read_bytes_sec
(gauge)
Bytes/second being read
Shown as byte
ceph.read_op_per_sec
(gauge)
Per-pool read operations/second
Shown as operation
ceph.recovery_bytes_per_sec
(gauge)
Rate of recovered bytes
Shown as byte
ceph.recovery_keys_per_sec
(gauge)
Rate of recovered keys
Shown as item
ceph.recovery_objects_per_sec
(gauge)
Rate of recovered objects
Shown as item
ceph.total_objects
(gauge)
Object count from the underlying object store. [v<=3 only]
Shown as item
ceph.write_bytes
(gauge)
Per-pool write bytes
Shown as byte
ceph.write_bytes_sec
(gauge)
Bytes/second being written
Shown as byte
ceph.write_op_per_sec
(gauge)
Per-pool write operations/second
Shown as operation

Nota: Si estás ejecutando Ceph Luminous o posterior, la métrica ceph.osd.pct_used no está incluida.

Eventos

El check de Ceph no incluye eventos.

Checks de servicio

ceph.overall_status
Returns OK if your ceph cluster status is HEALTH_OK, WARNING if it’s HEALTH_WARNING, CRITICAL otherwise.
Statuses: ok, warning, critical

ceph.osd_down
Returns OK if you have no down OSD. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.
Statuses: ok, warning, critical

ceph.osd_orphan
Returns OK if you have no orphan OSD. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.
Statuses: ok, warning, critical

ceph.osd_full
Returns OK if your OSDs are not full. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.
Statuses: ok, warning, critical

ceph.osd_nearfull
Returns OK if your OSDs are not near full. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.
Statuses: ok, warning, critical

ceph.pool_full
Returns OK if your pools have not reached their quota. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.
Statuses: ok, warning, critical

ceph.pool_near_full
Returns OK if your pools are not near reaching their quota. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.
Statuses: ok, warning, critical

ceph.pg_availability
Returns OK if there is full data availability. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.
Statuses: ok, warning, critical

ceph.pg_degraded
Returns OK if there is full data redundancy. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.
Statuses: ok, warning, critical

ceph.pg_degraded_full
Returns OK if there is enough space in the cluster for data redundancy. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.
Statuses: ok, warning, critical

ceph.pg_damaged
Returns OK if there are no inconsistencies after data scrubing. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.
Statuses: ok, warning, critical

ceph.pg_not_scrubbed
Returns OK if the PGs were scrubbed recently. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.
Statuses: ok, warning, critical

ceph.pg_not_deep_scrubbed
Returns OK if the PGs were deep scrubbed recently. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.
Statuses: ok, warning, critical

ceph.cache_pool_near_full
Returns OK if the cache pools are not near full. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.
Statuses: ok, warning, critical

ceph.too_few_pgs
Returns OK if the number of PGs is above the min threshold. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.
Statuses: ok, warning, critical

ceph.too_many_pgs
Returns OK if the number of PGs is below the max threshold. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.
Statuses: ok, warning, critical

ceph.object_unfound
Returns OK if all objects can be found. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.
Statuses: ok, warning, critical

ceph.request_slow
Returns OK requests are taking a normal time to process. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.
Statuses: ok, warning, critical

ceph.request_stuck
Returns OK requests are taking a normal time to process. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.
Statuses: ok, warning, critical

Solucionar problemas

¿Necesitas ayuda? Consulta el servicio de asistencia de Datadog.

Referencias adicionales