Ceph

Supported OS Linux Mac OS

Integration version4.2.0

Ceph dashboard

Overview

Enable the Datadog-Ceph integration to:

  • Track disk usage across storage pools
  • Receive service checks in case of issues
  • Monitor I/O performance metrics

Minimum Agent version: 6.0.0

Setup

Installation

The Ceph check is included in the Datadog Agent package, so you don’t need to install anything else on your Ceph servers.

Configuration

Edit the file ceph.d/conf.yaml in the conf.d/ folder at the root of your Agent’s configuration directory. See the sample ceph.d/conf.yaml for all available configuration options:

init_config:

instances:
  - ceph_cmd: /path/to/your/ceph # default is /usr/bin/ceph
    use_sudo: true # only if the ceph binary needs sudo on your nodes

If you enabled use_sudo, add a line like the following to your sudoers file:

dd-agent ALL=(ALL) NOPASSWD:/path/to/your/ceph

Log collection

Available for Agent versions >6.0

  1. Collecting logs is disabled by default in the Datadog Agent, enable it in your datadog.yaml file:

    logs_enabled: true
    
  2. Next, edit ceph.d/conf.yaml by uncommenting the logs lines at the bottom. Update the logs path with the correct path to your Ceph log files.

    logs:
      - type: file
        path: /var/log/ceph/*.log
        source: ceph
        service: "<APPLICATION_NAME>"
    
  3. Restart the Agent.

Validation

Run the Agent’s status subcommand and look for ceph under the Checks section.

Data Collected

Metrics

ceph.aggregate_pct_used
(gauge)
Overall capacity usage metric
Shown as percent
ceph.apply_latency_ms
(gauge)
Time taken to flush an update to disks
Shown as millisecond
ceph.class_pct_used
(gauge)
Per-class percentage of raw storage used
Shown as percent
ceph.commit_latency_ms
(gauge)
Time taken to commit an operation to the journal
Shown as millisecond
ceph.misplaced_objects
(gauge)
Number of objects misplaced
Shown as item
ceph.misplaced_total
(gauge)
Total number of objects if there are misplaced objects
Shown as item
ceph.num_full_osds
(gauge)
Number of full osds
Shown as item
ceph.num_in_osds
(gauge)
Number of participating storage daemons
Shown as item
ceph.num_mons
(gauge)
Number of monitor daemons
Shown as item
ceph.num_near_full_osds
(gauge)
Number of nearly full osds
Shown as item
ceph.num_objects
(gauge)
Object count for a given pool
Shown as item
ceph.num_osds
(gauge)
Number of known storage daemons
Shown as item
ceph.num_pgs
(gauge)
Number of placement groups available
Shown as item
ceph.num_pools
(gauge)
Number of pools
Shown as item
ceph.num_up_osds
(gauge)
Number of online storage daemons
Shown as item
ceph.op_per_sec
(gauge)
IO operations per second for given pool
Shown as operation
ceph.osd.pct_used
(gauge)
Percentage used of full/near full osds
Shown as percent
ceph.pgstate.active_clean
(gauge)
Number of active+clean placement groups
Shown as item
ceph.read_bytes
(gauge)
Per-pool read bytes
Shown as byte
ceph.read_bytes_sec
(gauge)
Bytes/second being read
Shown as byte
ceph.read_op_per_sec
(gauge)
Per-pool read operations/second
Shown as operation
ceph.recovery_bytes_per_sec
(gauge)
Rate of recovered bytes
Shown as byte
ceph.recovery_keys_per_sec
(gauge)
Rate of recovered keys
Shown as item
ceph.recovery_objects_per_sec
(gauge)
Rate of recovered objects
Shown as item
ceph.total_objects
(gauge)
Object count from the underlying object store. [v<=3 only]
Shown as item
ceph.write_bytes
(gauge)
Per-pool write bytes
Shown as byte
ceph.write_bytes_sec
(gauge)
Bytes/second being written
Shown as byte
ceph.write_op_per_sec
(gauge)
Per-pool write operations/second
Shown as operation

Note: If you are running Ceph luminous or later, the ceph.osd.pct_used metric is not included.

Events

The Ceph check does not include any events.

Service Checks

ceph.overall_status

Returns OK if your ceph cluster status is HEALTH_OK, WARNING if it’s HEALTH_WARNING, CRITICAL otherwise.

Statuses: ok, warning, critical

ceph.osd_down

Returns OK if you have no down OSD. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

Statuses: ok, warning, critical

ceph.osd_orphan

Returns OK if you have no orphan OSD. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

Statuses: ok, warning, critical

ceph.osd_full

Returns OK if your OSDs are not full. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

Statuses: ok, warning, critical

ceph.osd_nearfull

Returns OK if your OSDs are not near full. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

Statuses: ok, warning, critical

ceph.pool_full

Returns OK if your pools have not reached their quota. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

Statuses: ok, warning, critical

ceph.pool_near_full

Returns OK if your pools are not near reaching their quota. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

Statuses: ok, warning, critical

ceph.pg_availability

Returns OK if there is full data availability. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

Statuses: ok, warning, critical

ceph.pg_degraded

Returns OK if there is full data redundancy. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

Statuses: ok, warning, critical

ceph.pg_degraded_full

Returns OK if there is enough space in the cluster for data redundancy. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

Statuses: ok, warning, critical

ceph.pg_damaged

Returns OK if there are no inconsistencies after data scrubing. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

Statuses: ok, warning, critical

ceph.pg_not_scrubbed

Returns OK if the PGs were scrubbed recently. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

Statuses: ok, warning, critical

ceph.pg_not_deep_scrubbed

Returns OK if the PGs were deep scrubbed recently. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

Statuses: ok, warning, critical

ceph.cache_pool_near_full

Returns OK if the cache pools are not near full. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

Statuses: ok, warning, critical

ceph.too_few_pgs

Returns OK if the number of PGs is above the min threshold. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

Statuses: ok, warning, critical

ceph.too_many_pgs

Returns OK if the number of PGs is below the max threshold. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

Statuses: ok, warning, critical

ceph.object_unfound

Returns OK if all objects can be found. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

Statuses: ok, warning, critical

ceph.request_slow

Returns OK requests are taking a normal time to process. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

Statuses: ok, warning, critical

ceph.request_stuck

Returns OK requests are taking a normal time to process. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

Statuses: ok, warning, critical

Troubleshooting

Need help? Contact Datadog support.

Further Reading

Additional helpful documentation, links, and articles: