Logging is here!

Ceph

Agent Check Agent Check

Supported OS: Linux Mac OS

Ceph Graph

Overview

Enable the Datadog-Ceph integration to:

  • Track disk usage across storage pools
  • Receive service checks in case of issues
  • Monitor I/O performance metrics

Setup

Installation

The Ceph check is included in the Datadog Agent package, so you don’t need to install anything else on your Ceph servers.

Configuration

Edit the file ceph.d/conf.yaml in the conf.d/ folder at the root of your Agent’s configuration directory. See the sample ceph.d/conf.yaml for all available configuration options:

init_config:

instances:
  - ceph_cmd: /path/to/your/ceph # default is /usr/bin/ceph
    use_sudo: true               # only if the ceph binary needs sudo on your nodes

If you enabled use_sudo, add a line like the following to your sudoers file:

dd-agent ALL=(ALL) NOPASSWD:/path/to/your/ceph

Validation

Run the Agent’s status subcommand and look for ceph under the Checks section.

Data Collected

Metrics

ceph.commit_latency_ms
(gauge)
Time taken to commit an operation to the journal
shown as millisecond
ceph.apply_latency_ms
(gauge)
Time taken to flush an update to disks
shown as millisecond
ceph.op_per_sec
(gauge)
IO operations per second for given pool
shown as operation
ceph.read_bytes_sec
(gauge)
Bytes/second being read
shown as byte
ceph.write_bytes_sec
(gauge)
Bytes/second being written
shown as byte
ceph.num_osds
(gauge)
Number of known storage daemons
shown as item
ceph.num_in_osds
(gauge)
Number of participating storage daemons
shown as item
ceph.num_up_osds
(gauge)
Number of online storage daemons
shown as item
ceph.num_pgs
(gauge)
Number of placement groups available
shown as item
ceph.num_mons
(gauge)
Number of monitor daemons
shown as item
ceph.aggregate_pct_used
(gauge)
Overall capacity usage metric
shown as percent
ceph.total_objects
(gauge)
Object count from the underlying object store
shown as item
ceph.num_objects
(gauge)
Object count for a given pool
shown as item
ceph.read_bytes
(rate)
Per-pool read bytes
shown as byte
ceph.write_bytes
(rate)
Per-pool write bytes
shown as byte
ceph.num_pools
(gauge)
Number of pools
shown as item
ceph.pgstate.active_clean
(gauge)
Number of active+clean placement groups
shown as item
ceph.read_op_per_sec
(gauge)
Per-pool read operations/second
shown as operation
ceph.write_op_per_sec
(gauge)
Per-pool write operations/second
shown as operation
ceph.num_near_full_osds
(gauge)
Number of nearly full osds
shown as item
ceph.num_full_osds
(gauge)
Number of full osds
shown as item
ceph.osd.pct_used
(gauge)
Percentage used of full/near full osds
shown as percent

Note: If you are running ceph luminous or later, you will not see the metric ceph.osd.pct_used.

Events

The Ceph check does not include any events at this time.

Service Checks

  • ceph.overall_status : The Datadog Agent submits a service check for each of Ceph’s host health checks.

In addition to this service check, the Ceph check also collects a configurable list of health checks for Ceph luminous and later. By default, these are:

  • ceph.osd_down : Returns OK if your OSDs are all up. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

  • ceph.osd_orphan : Returns OK if you have no orphan OSD. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

  • ceph.osd_full : Returns OK if your OSDs are not full. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

  • ceph.osd_nearfull : Returns OK if your OSDs are not near full. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

  • ceph.pool_full : Returns OK if your pools have not reached their quota. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

  • ceph.pool_near_full : Returns OK if your pools are not near reaching their quota. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

  • ceph.pg_availability : Returns OK if there is full data availability. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

  • ceph.pg_degraded : Returns OK if there is full data redundancy. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

  • ceph.pg_degraded_full : Returns OK if there is enough space in the cluster for data redundancy. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

  • ceph.pg_damaged : Returns OK if there are no inconsistencies after data scrubing. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

  • ceph.pg_not_scrubbed : Returns OK if the PGs were scrubbed recently. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

  • ceph.pg_not_deep_scrubbed : Returns OK if the PGs were deep scrubbed recently. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

  • ceph.cache_pool_near_full : Returns OK if the cache pools are not near full. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

  • ceph.too_few_pgs : Returns OK if the number of PGs is above the min threshold. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

  • ceph.too_many_pgs : Returns OK if the number of PGs is below the max threshold. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

  • ceph.object_unfound : Returns OK if all objects can be found. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

  • ceph.request_slow : Returns OK requests are taking a normal time to process. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

  • ceph.request_stuck : Returns OK requests are taking a normal time to process. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

Troubleshooting

Need help? Contact Datadog Support.

Further Reading


Mistake in the docs? Feel free to contribute!