Ceph

Supported OS Linux Mac OS

Integration version2.10.0

Ceph dashboard

개요

Datadog-Ceph 통합을 활성화해 다음 작업을 수행할 수 있습니다.

  • 스토리지 풀 전반의 디스크 사용량 추적
  • 문제 시 서비스 점검 수신
  • I/O 성능 메트릭 모니터링

설정

설치

Ceph 점검은 Datadog 에이전트 패키지에 포함되어 있으므로 Ceph 서버에서 아무 것도 설치할 필요가 없습니다.

설정

에이전트 설정 디렉터리 루트에 있는 conf.d/ 폴더에서 ceph.d/conf.yaml 파일을 편집합니다. 사용 가능한 모든 옵션은 sample ceph.d/conf.yaml을 참조하세요.

init_config:

instances:
  - ceph_cmd: /path/to/your/ceph # default is /usr/bin/ceph
    use_sudo: true # only if the ceph binary needs sudo on your nodes

use_sudo을 활성화하면 다음과 같은 라인을 sudoers 파일에 추가합니다.

dd-agent ALL=(ALL) NOPASSWD:/path/to/your/ceph

로그 수집

에이전트 버전 > 6.0 이상 사용 가능

  1. Datadog 에이전트에서 로그 수집은 기본적으로 사용하지 않도록 설정되어 있습니다. datadog.yaml 파일에서 로그 수집을 사용하도록 설정합니다.

    logs_enabled: true
    
  2. 다음으로 아래에서 logs 라인의 주석을 제거하여 ceph.d/conf.yaml을 편집합니다. Ceph 로그 파일에 대한 올바른 경로를 사용해 로그 path를 업데이트합니다.

    logs:
      - type: file
        path: /var/log/ceph/*.log
        source: ceph
        service: "<APPLICATION_NAME>"
    
  3. 에이전트를 재시작합니다.

검증

에이전트 상태 하위 명령을 실행하고 점검 섹션 아래에서 ceph를 찾으세요.

수집한 데이터

메트릭

ceph.aggregate_pct_used
(gauge)
Overall capacity usage metric
Shown as percent
ceph.apply_latency_ms
(gauge)
Time taken to flush an update to disks
Shown as millisecond
ceph.commit_latency_ms
(gauge)
Time taken to commit an operation to the journal
Shown as millisecond
ceph.misplaced_objects
(gauge)
Number of objects misplaced
Shown as item
ceph.misplaced_total
(gauge)
Total number of objects if there are misplaced objects
Shown as item
ceph.num_full_osds
(gauge)
Number of full osds
Shown as item
ceph.num_in_osds
(gauge)
Number of participating storage daemons
Shown as item
ceph.num_mons
(gauge)
Number of monitor daemons
Shown as item
ceph.num_near_full_osds
(gauge)
Number of nearly full osds
Shown as item
ceph.num_objects
(gauge)
Object count for a given pool
Shown as item
ceph.num_osds
(gauge)
Number of known storage daemons
Shown as item
ceph.num_pgs
(gauge)
Number of placement groups available
Shown as item
ceph.num_pools
(gauge)
Number of pools
Shown as item
ceph.num_up_osds
(gauge)
Number of online storage daemons
Shown as item
ceph.op_per_sec
(gauge)
IO operations per second for given pool
Shown as operation
ceph.osd.pct_used
(gauge)
Percentage used of full/near full osds
Shown as percent
ceph.pgstate.active_clean
(gauge)
Number of active+clean placement groups
Shown as item
ceph.read_bytes
(gauge)
Per-pool read bytes
Shown as byte
ceph.read_bytes_sec
(gauge)
Bytes/second being read
Shown as byte
ceph.read_op_per_sec
(gauge)
Per-pool read operations/second
Shown as operation
ceph.recovery_bytes_per_sec
(gauge)
Rate of recovered bytes
Shown as byte
ceph.recovery_keys_per_sec
(gauge)
Rate of recovered keys
Shown as item
ceph.recovery_objects_per_sec
(gauge)
Rate of recovered objects
Shown as item
ceph.total_objects
(gauge)
Object count from the underlying object store. [v<=3 only]
Shown as item
ceph.write_bytes
(gauge)
Per-pool write bytes
Shown as byte
ceph.write_bytes_sec
(gauge)
Bytes/second being written
Shown as byte
ceph.write_op_per_sec
(gauge)
Per-pool write operations/second
Shown as operation

참고: Ceph luminous 이상 버전을 실행 중인 경우 ceph.osd.pct_used 메트릭이 포함되지 않습니다.

이벤트

Ceph 점검은 이벤트를 포함하지 않습니다.

서비스 검사

ceph.overall_status
Returns OK if your ceph cluster status is HEALTH_OK, WARNING if it’s HEALTH_WARNING, CRITICAL otherwise.
Statuses: ok, warning, critical

ceph.osd_down
Returns OK if you have no down OSD. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.
Statuses: ok, warning, critical

ceph.osd_orphan
Returns OK if you have no orphan OSD. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.
Statuses: ok, warning, critical

ceph.osd_full
Returns OK if your OSDs are not full. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.
Statuses: ok, warning, critical

ceph.osd_nearfull
Returns OK if your OSDs are not near full. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.
Statuses: ok, warning, critical

ceph.pool_full
Returns OK if your pools have not reached their quota. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.
Statuses: ok, warning, critical

ceph.pool_near_full
Returns OK if your pools are not near reaching their quota. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.
Statuses: ok, warning, critical

ceph.pg_availability
Returns OK if there is full data availability. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.
Statuses: ok, warning, critical

ceph.pg_degraded
Returns OK if there is full data redundancy. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.
Statuses: ok, warning, critical

ceph.pg_degraded_full
Returns OK if there is enough space in the cluster for data redundancy. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.
Statuses: ok, warning, critical

ceph.pg_damaged
Returns OK if there are no inconsistencies after data scrubing. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.
Statuses: ok, warning, critical

ceph.pg_not_scrubbed
Returns OK if the PGs were scrubbed recently. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.
Statuses: ok, warning, critical

ceph.pg_not_deep_scrubbed
Returns OK if the PGs were deep scrubbed recently. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.
Statuses: ok, warning, critical

ceph.cache_pool_near_full
Returns OK if the cache pools are not near full. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.
Statuses: ok, warning, critical

ceph.too_few_pgs
Returns OK if the number of PGs is above the min threshold. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.
Statuses: ok, warning, critical

ceph.too_many_pgs
Returns OK if the number of PGs is below the max threshold. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.
Statuses: ok, warning, critical

ceph.object_unfound
Returns OK if all objects can be found. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.
Statuses: ok, warning, critical

ceph.request_slow
Returns OK requests are taking a normal time to process. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.
Statuses: ok, warning, critical

ceph.request_stuck
Returns OK requests are taking a normal time to process. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.
Statuses: ok, warning, critical

트러블슈팅

도움이 필요하신가요? Datadog 지원팀에 문의하세요.

참고 자료