New announcements for Serverless, Network, RUM, and more from Dash! New announcements from Dash!

Ceph

Agent Check Agent Check

Supported OS: Linux Mac OS

Ceph dashboard

Overview

Enable the Datadog-Ceph integration to:

  • Track disk usage across storage pools
  • Receive service checks in case of issues
  • Monitor I/O performance metrics

Setup

Find below instructions to install and configure the check when running the Agent on a host. See the Autodiscovery Integration Templates documentation to learn how to apply those instructions to a containerized environment.

Installation

The Ceph check is included in the Datadog Agent package, so you don’t need to install anything else on your Ceph servers.

Configuration

Edit the file ceph.d/conf.yaml in the conf.d/ folder at the root of your Agent’s configuration directory. See the sample ceph.d/conf.yaml for all available configuration options:

init_config:

instances:
  - ceph_cmd: /path/to/your/ceph # default is /usr/bin/ceph
    use_sudo: true               # only if the ceph binary needs sudo on your nodes

If you enabled use_sudo, add a line like the following to your sudoers file:

dd-agent ALL=(ALL) NOPASSWD:/path/to/your/ceph

Log Collection

To enable collecting logs in the Datadog Agent, update logs_enabled in datadog.yaml:

    logs_enabled: true

Next, edit ceph.d/conf.yaml by uncommenting the logs lines at the bottom. Update the logs path with the correct path to your Ceph log files.

logs:
 - type: file
   path: /var/log/ceph/*.log
   source: ceph
   service: <APPLICATION_NAME>

Validation

Run the Agent’s status subcommand and look for ceph under the Checks section.

Data Collected

Metrics

ceph.commit_latency_ms
(gauge)
Time taken to commit an operation to the journal
shown as millisecond
ceph.apply_latency_ms
(gauge)
Time taken to flush an update to disks
shown as millisecond
ceph.op_per_sec
(gauge)
IO operations per second for given pool
shown as operation
ceph.read_bytes_sec
(gauge)
Bytes/second being read
shown as byte
ceph.write_bytes_sec
(gauge)
Bytes/second being written
shown as byte
ceph.num_osds
(gauge)
Number of known storage daemons
shown as item
ceph.num_in_osds
(gauge)
Number of participating storage daemons
shown as item
ceph.num_up_osds
(gauge)
Number of online storage daemons
shown as item
ceph.num_pgs
(gauge)
Number of placement groups available
shown as item
ceph.num_mons
(gauge)
Number of monitor daemons
shown as item
ceph.aggregate_pct_used
(gauge)
Overall capacity usage metric
shown as percent
ceph.total_objects
(gauge)
Object count from the underlying object store
shown as item
ceph.num_objects
(gauge)
Object count for a given pool
shown as item
ceph.read_bytes
(rate)
Per-pool read bytes
shown as byte
ceph.write_bytes
(rate)
Per-pool write bytes
shown as byte
ceph.num_pools
(gauge)
Number of pools
shown as item
ceph.pgstate.active_clean
(gauge)
Number of active+clean placement groups
shown as item
ceph.read_op_per_sec
(gauge)
Per-pool read operations/second
shown as operation
ceph.write_op_per_sec
(gauge)
Per-pool write operations/second
shown as operation
ceph.num_near_full_osds
(gauge)
Number of nearly full osds
shown as item
ceph.num_full_osds
(gauge)
Number of full osds
shown as item
ceph.osd.pct_used
(gauge)
Percentage used of full/near full osds
shown as percent

Note: If you are running ceph luminous or later, you will not see the metric ceph.osd.pct_used.

Events

The Ceph check does not include any events.

Service Checks

  • ceph.overall_status : The Datadog Agent submits a service check for each of Ceph’s host health checks.

In addition to this service check, the Ceph check also collects a configurable list of health checks for Ceph luminous and later. By default, these are:

  • ceph.osd_down : Returns OK if your OSDs are all up. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

  • ceph.osd_orphan : Returns OK if you have no orphan OSD. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

  • ceph.osd_full : Returns OK if your OSDs are not full. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

  • ceph.osd_nearfull : Returns OK if your OSDs are not near full. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

  • ceph.pool_full : Returns OK if your pools have not reached their quota. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

  • ceph.pool_near_full : Returns OK if your pools are not near reaching their quota. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

  • ceph.pg_availability : Returns OK if there is full data availability. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

  • ceph.pg_degraded : Returns OK if there is full data redundancy. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

  • ceph.pg_degraded_full : Returns OK if there is enough space in the cluster for data redundancy. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

  • ceph.pg_damaged : Returns OK if there are no inconsistencies after data scrubing. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

  • ceph.pg_not_scrubbed : Returns OK if the PGs were scrubbed recently. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

  • ceph.pg_not_deep_scrubbed : Returns OK if the PGs were deep scrubbed recently. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

  • ceph.cache_pool_near_full : Returns OK if the cache pools are not near full. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

  • ceph.too_few_pgs : Returns OK if the number of PGs is above the min threshold. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

  • ceph.too_many_pgs : Returns OK if the number of PGs is below the max threshold. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

  • ceph.object_unfound : Returns OK if all objects can be found. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

  • ceph.request_slow : Returns OK requests are taking a normal time to process. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

  • ceph.request_stuck : Returns OK requests are taking a normal time to process. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

Troubleshooting

Need help? Contact Datadog support.

Further Reading


Mistake in the docs? Feel free to contribute!