etcd

Supported OS Linux Mac OS Windows

통합 버전8.1.0

Etcd Dashboard

개요

Etcd 메트릭을 수집하여 다음을 수행합니다.

  • Etcd 클러스터의 상태를 모니터링합니다.
  • 호스트 구성이 동기화되지 않을 수 있는 시기를 파악합니다.
  • Etcd의 성능을 나머지 애플리케이션과 상호 연관시킵니다.

설정

설치

Etcd 검사는 Datadog Agent 패키지에 포함되어 있으므로 Etcd 인스턴스에 다른 것을 설치할 필요가 없습니다.

구성

호스트

호스트에서 실행 중인 에이전트에 대해 이 점검을 구성하려면:

메트릭 수집
  1. Agent 구성 디렉터리의 루트에 있는 conf.d/ 폴더에서 etcd.d/conf.yaml 파일을 편집하여 Etcd 성능 데이터 수집을 시작하세요. 사용 가능한 모든 구성 옵션은 샘플 etcd.d/conf.yaml을 참조하세요.
  2. [에이전트]를 다시 시작합니다3.
로그 수집
  1. Datadog 에이전트에서 로그 수집은 기본적으로 사용하지 않도록 설정되어 있습니다. datadog.yaml파일에서 로그 수집을 사용하도록 설정합니다.

    logs_enabled: true
    
  2. etcd.d/conf.yaml의 하단에서 이 구성 블록의 주석 처리를 제거하고 편집합니다.

    logs:
      - type: file
        path: "<LOG_FILE_PATH>"
        source: etcd
        service: "<SERVICE_NAME>"
    

    환경에 따라 pathservice 파라미터 값을 변경합니다. 사용 가능한 모든 구성 옵션은 샘플 etcd.d/conf.yaml을 참조하세요.

  3. 에이전트를 재시작하세요.

컨테이너화

컨테이너화된 환경의 경우 자동탐지 통합 템플릿에 다음 파라미터를 적용하는 방법이 안내되어 있습니다.

메트릭 수집
파라미터
<INTEGRATION_NAME>etcd
<INIT_CONFIG>비어 있음 또는 {}
<INSTANCE_CONFIG>{"prometheus_url": "http://%%host%%:2379/metrics"}
로그 수집

Datadog 에이전트에서 로그 수집은 기본값으로 비활성화되어 있습니다. 이를 활성화하려면 쿠버네티스(Kubernetes) 로그 수집을 참조하세요.

파라미터
<LOG_CONFIG>{"source": "etcd", "service": "<SERVICE_NAME>"}

검증

Agent의 status 하위 명령을 실행하고 Checks 섹션에서 etcd를 찾습니다.

수집한 데이터

메트릭

etcd.debugging.mvcc.db.compaction.keys.total
(count)
Total number of db keys compacted.
Shown as key
etcd.debugging.mvcc.db.compaction.pause.duration.milliseconds
(gauge)
Bucketed histogram of db compaction pause duration.
Shown as millisecond
etcd.debugging.mvcc.db.compaction.total.duration.milliseconds
(gauge)
Bucketed histogram of db compaction total duration.
Shown as millisecond
etcd.debugging.mvcc.db.total.size.in_bytes
(gauge)
Total size of the underlying database in bytes.
Shown as byte
etcd.debugging.mvcc.delete.total
(count)
Total number of deletes seen by this member.
Shown as query
etcd.debugging.mvcc.events.total
(count)
Total number of events sent by this member.
Shown as event
etcd.debugging.mvcc.index.compaction.pause.duration.milliseconds
(gauge)
Bucketed histogram of index compaction pause duration.
Shown as millisecond
etcd.debugging.mvcc.keys.total
(gauge)
Total number of keys.
Shown as key
etcd.debugging.mvcc.pending.events.total
(gauge)
Total number of pending events to be sent.
Shown as event
etcd.debugging.mvcc.put.total
(count)
Total number of puts seen by this member.
Shown as query
etcd.debugging.mvcc.range.total
(count)
Total number of ranges seen by this member.
Shown as query
etcd.debugging.mvcc.slow_watcher.total
(gauge)
Total number of unsynced slow watchers.
Shown as connection
etcd.debugging.mvcc.txn.total
(count)
Total number of txns seen by this member.
Shown as transaction
etcd.debugging.mvcc.watch_stream.total
(gauge)
Total number of watch streams.
Shown as connection
etcd.debugging.mvcc.watcher.total
(gauge)
Total number of watchers.
Shown as connection
etcd.debugging.server.lease.expired.total
(count)
The total number of expired leases.
Shown as item
etcd.debugging.snap.save.marshalling.duration.seconds
(gauge)
The marshalling cost distributions of save called by snapshot.
Shown as second
etcd.debugging.snap.save.total.duration.seconds
(gauge)
The total latency distributions of save called by snapshot.
Shown as second
etcd.debugging.store.expires.total
(count)
Total number of expired keys.
Shown as key
etcd.debugging.store.reads.total
(count)
Total number of reads action by (get/getRecursive), local to this member.
Shown as read
etcd.debugging.store.watch.requests.total
(count)
Total number of incoming watch requests (new or reestablished).
Shown as request
etcd.debugging.store.watchers
(gauge)
Count of currently active watchers.
Shown as connection
etcd.debugging.store.writes.total
(count)
Total number of writes (e.g. set/compareAndDelete) seen by this member.
Shown as write
etcd.disk.backend.commit.duration.seconds
(gauge)
The latency distributions of commit called by backend.
Shown as second
etcd.disk.backend.snapshot.duration.seconds
(gauge)
The latency distribution of backend snapshots.
Shown as second
etcd.disk.wal.fsync.duration.seconds.count
(count)
The count of latency distributions of fsync called by wal.
Shown as second
etcd.disk.wal.fsync.duration.seconds.sum
(gauge)
The sum of latency distributions of fsync called by wal.
Shown as second
etcd.disk.wal.write.bytes.total
(gauge)
Total number of bytes written in WAL
Shown as byte
etcd.etcd.server.client.requests.total
(count)
The total number of client requests per client version
Shown as request
etcd.go.gc.duration.seconds
(gauge)
A summary of the GC invocation durations.
Shown as second
etcd.go.goroutines
(gauge)
Number of goroutines that currently exist.
Shown as thread
etcd.go.info
(gauge)
Information about the Go environment.
Shown as item
etcd.go.memstats.alloc.bytes
(gauge)
Number of bytes allocated and still in use.
Shown as byte
etcd.go.memstats.alloc.bytes.total
(count)
Total number of bytes allocated, even if freed.
Shown as byte
etcd.go.memstats.buck.hash.sys.bytes
(gauge)
Number of bytes used by the profiling bucket hash table.
Shown as byte
etcd.go.memstats.frees.total
(count)
Total number of frees.
Shown as occurrence
etcd.go.memstats.gc.cpu.fraction
(gauge)
The fraction of this program's available CPU time used by the GC since the program started.
Shown as cpu
etcd.go.memstats.gc.sys.bytes
(gauge)
Number of bytes used for garbage collection system metadata.
Shown as byte
etcd.go.memstats.heap.alloc.bytes
(gauge)
Number of heap bytes allocated and still in use.
Shown as byte
etcd.go.memstats.heap.idle.bytes
(gauge)
Number of heap bytes waiting to be used.
Shown as byte
etcd.go.memstats.heap.inuse.bytes
(gauge)
Number of heap bytes that are in use.
Shown as byte
etcd.go.memstats.heap.objects
(gauge)
Number of allocated objects.
Shown as item
etcd.go.memstats.heap.released.bytes
(gauge)
Number of heap bytes released to OS.
Shown as byte
etcd.go.memstats.heap.sys.bytes
(gauge)
Number of heap bytes obtained from system.
Shown as byte
etcd.go.memstats.last.gc.time.seconds
(gauge)
Number of seconds since 1970 of last garbage collection.
Shown as second
etcd.go.memstats.lookups.total
(count)
Total number of pointer lookups.
Shown as occurrence
etcd.go.memstats.mallocs.total
(count)
Total number of mallocs.
Shown as occurrence
etcd.go.memstats.mcache.inuse.bytes
(gauge)
Number of bytes in use by mcache structures.
Shown as byte
etcd.go.memstats.mcache.sys.bytes
(gauge)
Number of bytes used for mcache structures obtained from system.
Shown as byte
etcd.go.memstats.mspan.inuse.bytes
(gauge)
Number of bytes in use by mspan structures.
Shown as byte
etcd.go.memstats.mspan.sys.bytes
(gauge)
Number of bytes used for mspan structures obtained from system.
Shown as byte
etcd.go.memstats.next.gc.bytes
(gauge)
Number of heap bytes when next garbage collection will take place.
Shown as byte
etcd.go.memstats.other.sys.bytes
(gauge)
Number of bytes used for other system allocations.
Shown as byte
etcd.go.memstats.stack.inuse.bytes
(gauge)
Number of bytes in use by the stack allocator.
Shown as byte
etcd.go.memstats.stack.sys.bytes
(gauge)
Number of bytes obtained from system for stack allocator.
Shown as byte
etcd.go.memstats.sys.bytes
(gauge)
Number of bytes obtained from system.
Shown as byte
etcd.go.threads
(gauge)
Number of OS threads created.
Shown as thread
etcd.grpc.proxy.cache.hits.total
(gauge)
Total number of cache hits
Shown as occurrence
etcd.grpc.proxy.cache.keys.total
(gauge)
Total number of keys/ranges cached
Shown as item
etcd.grpc.proxy.cache.misses.total
(gauge)
Total number of cache misses
Shown as occurrence
etcd.grpc.proxy.events.coalescing.total
(count)
Total number of events coalescing
Shown as event
etcd.grpc.proxy.watchers.coalescing.total
(gauge)
Total number of current watchers coalescing
Shown as connection
etcd.grpc.server.handled.total
(count)
Total number of RPCs completed on the server, regardless of success or failure.
Shown as operation
etcd.grpc.server.msg.received.total
(count)
Total number of RPC stream messages received on the server.
Shown as operation
etcd.grpc.server.msg.sent.total
(count)
Total number of gRPC stream messages sent by the server.
Shown as operation
etcd.grpc.server.started.total
(count)
Total number of RPCs started on the server.
Shown as operation
etcd.leader.counts.fail
(gauge)
Rate of failed Raft RPC requests (ETCD API V2 only)
Shown as request
etcd.leader.counts.success
(gauge)
Rate of successful Raft RPC requests (ETCD API V2 only)
Shown as request
etcd.leader.latency.avg
(gauge)
Average latency to each peer in the cluster (ETCD API V2 only)
Shown as millisecond
etcd.leader.latency.current
(gauge)
Current latency to each peer in the cluster (ETCD API V2 only)
Shown as millisecond
etcd.leader.latency.max
(gauge)
Maximum latency to each peer in the cluster (ETCD API V2 only)
Shown as millisecond
etcd.leader.latency.min
(gauge)
Minimum latency to each peer in the cluster (ETCD API V2 only)
Shown as millisecond
etcd.leader.latency.stddev
(gauge)
Standard deviation latency to each peer in the cluster (ETCD API V2 only)
Shown as millisecond
etcd.mvcc.db.total.size.in_use.bytes
(gauge)
Total size of the underlying database logically in use
Shown as byte
etcd.network.active_peers
(gauge)
The current number of active peer connections
Shown as connection
etcd.network.client.grpc.received.bytes.total
(count)
The total number of bytes received from grpc clients.
Shown as byte
etcd.network.client.grpc.sent.bytes.total
(count)
The total number of bytes sent to grpc clients.
Shown as byte
etcd.network.disconnected_peers.total
(count)
The total number of disconnected peers
Shown as connection
etcd.network.peer.received.bytes.total
(count)
The total number of bytes received from peers.
Shown as byte
etcd.network.peer.received.failures.total
(count)
The total number of receive failures from peers
Shown as event
etcd.network.peer.round_trip_time.seconds
(gauge)
Round-Trip-Time histogram between peers.
Shown as second
etcd.network.peer.sent.bytes.total
(count)
The total number of bytes sent to peers.
Shown as byte
etcd.network.peer.sent.failures.total
(count)
The total number of send failures from peers
Shown as event
etcd.network.snapshot.receive.failures.total
(count)
Total number of snapshot receive failures
Shown as event
etcd.network.snapshot.receive.inflights.total
(gauge)
Total number of inflight snapshot sends
Shown as event
etcd.network.snapshot.receive.success.total
(count)
Total number of successful snapshot receives
Shown as event
etcd.network.snapshot.receive.total.duration.seconds.count
(gauge)
Total latency distributions of v3 snapshot receives
Shown as second
etcd.network.snapshot.receive.total.duration.seconds.sum
(gauge)
Total latency distributions of v3 snapshot receives
Shown as second
etcd.network.snapshot.send.failures.total
(count)
The total number of send failures from peers
Shown as event
etcd.network.snapshot.send.inflights.total
(gauge)
Total number of inflight snapshot receives
Shown as event
etcd.network.snapshot.send.sucess.total
(count)
Total number of successful snapshot sends
Shown as event
etcd.network.snapshot.send.total.duration.seconds.count
(gauge)
Total latency distributions of v3 snapshot sends
Shown as second
etcd.network.snapshot.send.total.duration.seconds.sum
(gauge)
Total latency distributions of v3 snapshot sends
Shown as second
etcd.os.fd.limit
(gauge)
The file descriptor limit
Shown as object
etcd.os.fd.used
(gauge)
The number of used file descriptors
Shown as object
etcd.process.cpu.seconds.total
(count)
Total user and system CPU time spent in seconds.
Shown as cpu
etcd.process.max.fds
(gauge)
Maximum number of open file descriptors.
Shown as item
etcd.process.open.fds
(gauge)
Number of open file descriptors.
Shown as item
etcd.process.resident.memory.bytes
(gauge)
Resident memory size in bytes.
Shown as byte
etcd.process.start.time.seconds
(gauge)
Start time of the process since unix epoch in seconds.
Shown as second
etcd.process.virtual.memory.bytes
(gauge)
Virtual memory size in bytes.
Shown as byte
etcd.self.recv.appendrequest.count
(gauge)
Rate of append requests this node has processed (ETCD API V2 only)
Shown as request
etcd.self.recv.bandwidthrate
(gauge)
Rate of bytes received (ETCD API V2 only)
Shown as byte
etcd.self.recv.pkgrate
(gauge)
Rate of packets received (ETCD API V2 only)
Shown as packet
etcd.self.send.appendrequest.count
(gauge)
Rate of append requests this node has sent (ETCD API V2 only)
Shown as request
etcd.self.send.bandwidthrate
(gauge)
Rate of bytes sent (ETCD API V2 only)
Shown as byte
etcd.self.send.pkgrate
(gauge)
Rate of packets sent (ETCD API V2 only)
Shown as packet
etcd.server.apply.slow.total
(count)
The total number of slow apply requests (likely overloaded from slow disk)
Shown as request
etcd.server.go_version
(gauge)
Which Go version server is running with. 1 with label with current version
Shown as unit
etcd.server.has_leader
(gauge)
Whether or not a leader exists. 1 is existence, 0 is not.
Shown as check
etcd.server.health.failures.total
(count)
The total number of failed health checks
Shown as event
etcd.server.health.success.total
(count)
The total number of successful health checks
Shown as event
etcd.server.heartbeat.send.failures.total
(count)
The total number of leader heartbeat send failures (likely overloaded from slow disk)
Shown as event
etcd.server.is_leader
(gauge)
Whether or not this member is a leader. 1 if is, 0 otherwise.
Shown as check
etcd.server.leader.changes.seen.total
(count)
The number of leader changes seen.
Shown as event
etcd.server.lease.expired.total
(count)
The total number of expired leases
Shown as occurrence
etcd.server.proposals.applied.total
(gauge)
The total number of consensus proposals applied.
Shown as occurrence
etcd.server.proposals.committed.total
(gauge)
The total number of consensus proposals committed.
Shown as occurrence
etcd.server.proposals.failed.total
(count)
The total number of failed proposals seen.
Shown as occurrence
etcd.server.proposals.pending
(gauge)
The current number of pending proposals to commit.
Shown as occurrence
etcd.server.quota.backend.bytes
(gauge)
Current backend storage quota size in bytes
Shown as byte
etcd.server.read_indexes.failed.total
(count)
The total number of failed read indexes seen
Shown as event
etcd.server.read_indexes.slow.total
(count)
The total number of pending read indexes not in sync with leader or timed out read index requests
Shown as event
etcd.server.version
(gauge)
Which version is running. 1 for 'server_version' label with current version.
Shown as item
etcd.snap.db.fsync.duration.seconds.count
(gauge)
The latency distributions of fsyncing .snap.db file
Shown as second
etcd.snap.db.fsync.duration.seconds.sum
(gauge)
The latency distributions of fsyncing .snap.db file
Shown as second
etcd.snap.db.save.total.duration.seconds.count
(gauge)
The total latency distributions of v3 snapshot save
Shown as second
etcd.snap.db.save.total.duration.seconds.sum
(gauge)
The total latency distributions of v3 snapshot save
Shown as second
etcd.snap.fsync.duration.seconds.count
(gauge)
The latency distributions of fsync called by snap
Shown as second
etcd.snap.fsync.duration.seconds.sum
(gauge)
The latency distributions of fsync called by snap
Shown as second
etcd.store.compareanddelete.fail
(gauge)
Rate of compare and delete requests failure (ETCD API V2 only)
Shown as request
etcd.store.compareanddelete.success
(gauge)
Rate of compare and delete requests success (ETCD API V2 only)
Shown as request
etcd.store.compareandswap.fail
(gauge)
Rate of compare and swap requests failure (ETCD API V2 only)
Shown as request
etcd.store.compareandswap.success
(gauge)
Rate of compare and swap requests success (ETCD API V2 only)
Shown as request
etcd.store.create.fail
(gauge)
Rate of failed create requests (ETCD API V2 only)
Shown as request
etcd.store.create.success
(gauge)
Rate of successful create requests (ETCD API V2 only)
Shown as request
etcd.store.delete.fail
(gauge)
Rate of failed delete requests (ETCD API V2 only)
Shown as request
etcd.store.delete.success
(gauge)
Rate of successful delete requests (ETCD API V2 only)
Shown as request
etcd.store.expire.count
(gauge)
Rate of expired keys (ETCD API V2 only)
Shown as eviction
etcd.store.gets.fail
(gauge)
Rate of failed get requests (ETCD API V2 only)
Shown as request
etcd.store.gets.success
(gauge)
Rate of successful get requests (ETCD API V2 only)
Shown as request
etcd.store.sets.fail
(gauge)
Rate of failed set requests (ETCD API V2 only)
Shown as request
etcd.store.sets.success
(gauge)
Rate of successful set requests (ETCD API V2 only)
Shown as request
etcd.store.update.fail
(gauge)
Rate of failed update requests (ETCD API V2 only)
Shown as request
etcd.store.update.success
(gauge)
Rate of successful update requests (ETCD API V2 only)
Shown as request
etcd.store.watchers
(gauge)
Rate of watchers(ETCD API V2 only)

Etcd 메트릭에는 노드 상태에 따라 etcd_state:leader 또는 etcd_state:follower 태그가 지정되므로 상태별로 메트릭을 쉽게 집계할 수 있습니다.

이벤트

Etcd 점검은 이벤트를 포함하지 않습니다.

서비스 점검

etcd.can_connect
Returns CRITICAL if unable to get metrics from etcd (timeout or non-200 HTTP code). This service check is only available on the legacy version of the etcd check.
Statuses: ok, critical

etcd.healthy
Returns CRITICAL when a member is unhealthy. This service check is only available on the legacy version of the etcd check.
Statuses: ok, critical, unknown

etcd.prometheus.health
Returns CRITICAL if the check cannot access a metrics endpoint. Otherwise, returns OK. This service check is only available when use_preview is enabled.
Statuses: ok, critical

트러블슈팅

도움이 필요하신가요? Datadog 지원 팀에 문의하세요.

참고 자료