Ceph

Agent CheckAgent Check
Ceph Graph

Overview

Enable the Datadog-Ceph integration to:

  • Track disk usage across storage pools
  • Receive service checks in case of issues
  • Monitor I/O performance metrics

Setup

Installation

The Ceph check is packaged with the Agent, so simply install the Agent on your Ceph servers.

Configuration

Create a file ceph.yaml in the Agent’s conf.d directory. See the sample ceph.yaml for all available configuration options:

init_config:

instances:
  - ceph_cmd: /path/to/your/ceph # default is /usr/bin/ceph
    use_sudo: true               # only if the ceph binary needs sudo on your nodes

If you enabled use_sudo, add a line like the following to your sudoers file:

dd-agent ALL=(ALL) NOPASSWD:/path/to/your/ceph

Validation

Run the Agent’s info subcommand and look for ceph under the Checks section:

  Checks
  ======
    [...]

   ceph (5.19.0)
   -------------
   - instance #0 [OK]
   - Collected 24 metrics, 0 events & 1 service check

    [...]

Data Collected

Metrics

ceph.commit_latency_ms
(gauge)
Time taken to commit an operation to the journal
shown as millisecond
ceph.apply_latency_ms
(gauge)
Time taken to flush an update to disks
shown as millisecond
ceph.op_per_sec
(gauge)
IO operations per second for given pool
shown as operation
ceph.read_bytes_sec
(gauge)
Bytes/second being read
shown as byte
ceph.write_bytes_sec
(gauge)
Bytes/second being written
shown as byte
ceph.num_osds
(gauge)
Number of known storage daemons
shown as item
ceph.num_in_osds
(gauge)
Number of participating storage daemons
shown as item
ceph.num_up_osds
(gauge)
Number of online storage daemons
shown as item
ceph.num_pgs
(gauge)
Number of placement groups available
shown as item
ceph.num_mons
(gauge)
Number of monitor daemons
shown as item
ceph.aggregate_pct_used
(gauge)
Overall capacity usage metric
shown as percent
ceph.total_objects
(gauge)
Object count from the underlying object store
shown as item
ceph.num_objects
(gauge)
Object count for a given pool
shown as item
ceph.read_bytes
(rate)
Per-pool read bytes
shown as byte
ceph.write_bytes
(rate)
Per-pool write bytes
shown as byte
ceph.num_pools
(gauge)
Number of pools
shown as item
ceph.pgstate.active_clean
(gauge)
Number of active+clean placement groups
shown as item
ceph.read_op_per_sec
(gauge)
Per-pool read operations/second
shown as operation
ceph.write_op_per_sec
(gauge)
Per-pool write operations/second
shown as operation
ceph.num_near_full_osds
(gauge)
Number of nearly full osds
shown as item
ceph.num_full_osds
(gauge)
Number of full osds
shown as item
ceph.osd.pct_used
(gauge)
Percentage used of full/near full osds
shown as percent

Note: If you are running ceph luminous or later, you will not see the metric ceph.osd.pct_used.

Events

The Ceph check does not include any event at this time.

Service Checks

  • ceph.overall_status : The Datadog Agent submits a service check for each of Ceph’s host health checks.

In addition to this service check, the Ceph check also collects a configurable list of health checks for Ceph luminous and later. By default, these are:

  • ceph.osd_down : Returns OK if your OSDs are all up. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

  • ceph.osd_orphan : Returns OK if you have no orphan OSD. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

  • ceph.osd_full : Returns OK if your OSDs are not full. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

  • ceph.osd_nearfull : Returns OK if your OSDs are not near full. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

  • ceph.pool_full : Returns OK if your pools have not reached their quota. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

  • ceph.pool_near_full : Returns OK if your pools are not near reaching their quota. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

  • ceph.pg_availability : Returns OK if there is full data availability. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

  • ceph.pg_degraded : Returns OK if there is full data redundancy. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

  • ceph.pg_degraded_full : Returns OK if there is enough space in the cluster for data redundancy. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

  • ceph.pg_damaged : Returns OK if there are no inconsistencies after data scrubing. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

  • ceph.pg_not_scrubbed : Returns OK if the PGs were scrubbed recently. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

  • ceph.pg_not_deep_scrubbed : Returns OK if the PGs were deep scrubbed recently. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

  • ceph.cache_pool_near_full : Returns OK if the cache pools are not near full. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

  • ceph.too_few_pgs : Returns OK if the number of PGs is above the min threshold. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

  • ceph.too_many_pgs : Returns OK if the number of PGs is below the max threshold. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

  • ceph.object_unfound : Returns OK if all objects can be found. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

  • ceph.request_slow : Returns OK requests are taking a normal time to process. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

  • ceph.request_stuck : Returns OK requests are taking a normal time to process. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

Troubleshooting

Need help? Contact Datadog Support.

Further Reading