Ceph

Supported OS Linux Mac OS

통합 버전4.1.0

To find out if this integration is available in your organization, see your Datadog Integrations page or ask your organization administrator.

To initiate an exception request to enable this integration for your organization, email support@ddog-gov.com.

Ceph dashboard

개요

Datadog-Ceph 통합을 활성화해 다음 작업을 수행할 수 있습니다.

  • 스토리지 풀 전반의 디스크 사용량 추적
  • 문제 시 서비스 점검 수신
  • I/O 성능 메트릭 모니터링

설정

설치

Ceph 점검은 Datadog 에이전트 패키지에 포함되어 있으므로 Ceph 서버에서 아무 것도 설치할 필요가 없습니다.

구성

에이전트 설정 디렉터리 루트에 있는 conf.d/ 폴더에서 ceph.d/conf.yaml 파일을 편집합니다. 사용 가능한 모든 옵션은 샘플 ceph.d/conf.yaml을 참조하세요.

init_config:

instances:
  - ceph_cmd: /path/to/your/ceph # default is /usr/bin/ceph
    use_sudo: true # only if the ceph binary needs sudo on your nodes

use_sudo을 활성화하면 다음과 같은 라인을 sudoers 파일에 추가합니다.

dd-agent ALL=(ALL) NOPASSWD:/path/to/your/ceph

로그 수집

Agent 버전 6.0 이상에서 사용 가능

  1. Datadog 에이전트에서 로그 수집은 기본적으로 사용하지 않도록 설정되어 있습니다. datadog.yaml파일에서 로그 수집을 사용하도록 설정합니다.

    logs_enabled: true
    
  2. 다음으로 아래에서 logs 라인의 주석을 제거하여 ceph.d/conf.yaml을 편집합니다. Ceph 로그 파일에 대한 올바른 경로를 사용해 로그 path를 업데이트합니다.

    logs:
      - type: file
        path: /var/log/ceph/*.log
        source: ceph
        service: "<APPLICATION_NAME>"
    
  3. Agent를 재시작합니다.

검증

에이전트 상태 하위 명령을 실행하고 점검 섹션 아래에서 ceph를 찾으세요.

수집한 데이터

메트릭

ceph.aggregate_pct_used
(gauge)
Overall capacity usage metric
Shown as percent
ceph.apply_latency_ms
(gauge)
Time taken to flush an update to disks
Shown as millisecond
ceph.class_pct_used
(gauge)
Per-class percentage of raw storage used
Shown as percent
ceph.commit_latency_ms
(gauge)
Time taken to commit an operation to the journal
Shown as millisecond
ceph.misplaced_objects
(gauge)
Number of objects misplaced
Shown as item
ceph.misplaced_total
(gauge)
Total number of objects if there are misplaced objects
Shown as item
ceph.num_full_osds
(gauge)
Number of full osds
Shown as item
ceph.num_in_osds
(gauge)
Number of participating storage daemons
Shown as item
ceph.num_mons
(gauge)
Number of monitor daemons
Shown as item
ceph.num_near_full_osds
(gauge)
Number of nearly full osds
Shown as item
ceph.num_objects
(gauge)
Object count for a given pool
Shown as item
ceph.num_osds
(gauge)
Number of known storage daemons
Shown as item
ceph.num_pgs
(gauge)
Number of placement groups available
Shown as item
ceph.num_pools
(gauge)
Number of pools
Shown as item
ceph.num_up_osds
(gauge)
Number of online storage daemons
Shown as item
ceph.op_per_sec
(gauge)
IO operations per second for given pool
Shown as operation
ceph.osd.pct_used
(gauge)
Percentage used of full/near full osds
Shown as percent
ceph.pgstate.active_clean
(gauge)
Number of active+clean placement groups
Shown as item
ceph.read_bytes
(gauge)
Per-pool read bytes
Shown as byte
ceph.read_bytes_sec
(gauge)
Bytes/second being read
Shown as byte
ceph.read_op_per_sec
(gauge)
Per-pool read operations/second
Shown as operation
ceph.recovery_bytes_per_sec
(gauge)
Rate of recovered bytes
Shown as byte
ceph.recovery_keys_per_sec
(gauge)
Rate of recovered keys
Shown as item
ceph.recovery_objects_per_sec
(gauge)
Rate of recovered objects
Shown as item
ceph.total_objects
(gauge)
Object count from the underlying object store. [v<=3 only]
Shown as item
ceph.write_bytes
(gauge)
Per-pool write bytes
Shown as byte
ceph.write_bytes_sec
(gauge)
Bytes/second being written
Shown as byte
ceph.write_op_per_sec
(gauge)
Per-pool write operations/second
Shown as operation

Note: If you are running Ceph luminous or later, the ceph.osd.pct_used metric is not included.

참고: Ceph luminous 이상 버전을 실행 중인 경우 ceph.osd.pct_used 메트릭이 포함되지 않습니다.

이벤트

Ceph 점검은 이벤트를 포함하지 않습니다.

서비스 점검

ceph.overall_status

Returns OK if your ceph cluster status is HEALTH_OK, WARNING if it’s HEALTH_WARNING, CRITICAL otherwise.

Statuses: ok, warning, critical

ceph.osd_down

Returns OK if you have no down OSD. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

Statuses: ok, warning, critical

ceph.osd_orphan

Returns OK if you have no orphan OSD. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

Statuses: ok, warning, critical

ceph.osd_full

Returns OK if your OSDs are not full. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

Statuses: ok, warning, critical

ceph.osd_nearfull

Returns OK if your OSDs are not near full. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

Statuses: ok, warning, critical

ceph.pool_full

Returns OK if your pools have not reached their quota. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

Statuses: ok, warning, critical

ceph.pool_near_full

Returns OK if your pools are not near reaching their quota. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

Statuses: ok, warning, critical

ceph.pg_availability

Returns OK if there is full data availability. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

Statuses: ok, warning, critical

ceph.pg_degraded

Returns OK if there is full data redundancy. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

Statuses: ok, warning, critical

ceph.pg_degraded_full

Returns OK if there is enough space in the cluster for data redundancy. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

Statuses: ok, warning, critical

ceph.pg_damaged

Returns OK if there are no inconsistencies after data scrubing. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

Statuses: ok, warning, critical

ceph.pg_not_scrubbed

Returns OK if the PGs were scrubbed recently. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

Statuses: ok, warning, critical

ceph.pg_not_deep_scrubbed

Returns OK if the PGs were deep scrubbed recently. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

Statuses: ok, warning, critical

ceph.cache_pool_near_full

Returns OK if the cache pools are not near full. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

Statuses: ok, warning, critical

ceph.too_few_pgs

Returns OK if the number of PGs is above the min threshold. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

Statuses: ok, warning, critical

ceph.too_many_pgs

Returns OK if the number of PGs is below the max threshold. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

Statuses: ok, warning, critical

ceph.object_unfound

Returns OK if all objects can be found. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

Statuses: ok, warning, critical

ceph.request_slow

Returns OK requests are taking a normal time to process. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

Statuses: ok, warning, critical

ceph.request_stuck

Returns OK requests are taking a normal time to process. Otherwise, returns WARNING if the severity is HEALTH_WARN, else CRITICAL.

Statuses: ok, warning, critical

트러블슈팅

도움이 필요하신가요? Datadog 지원팀에 문의하세요.

참고 자료