The Service Map for APM is here!

etcd

Agent Check Agent Check

Supported OS: Linux Mac OS Windows

Etcd Dashboard

Overview

Collect etcd metrics to:

  • Monitor the health of your etcd cluster.
  • Know when host configurations may be out of sync.
  • Correlate the performance of etcd with the rest of your applications.

Setup

Installation

The etcd check is included in the Datadog Agent package, so you don’t need to install anything else on your etcd instance(s).

Configuration

  1. Edit the etcd.d/conf.yaml file, in the conf.d/ folder at the root of your Agent’s configuration directory to start collecting your etcd performance data. See the sample etcd.d/conf.yaml for all available configuration options.

    init_config:
    
    instances:
        - url: "https://server:port" # API endpoint of your etcd instance
  2. Restart the Agent

Validation

Run the Agent’s status subcommand and look for etcd under the Checks section.

Data Collected

Metrics

etcd.debugging.mvcc.db.compaction.keys.total
(count)
Total number of db keys compacted.
shown as key
etcd.debugging.mvcc.db.compaction.pause.duration.milliseconds
(gauge)
Bucketed histogram of db compaction pause duration.
shown as millisecond
etcd.debugging.mvcc.db.compaction.total.duration.milliseconds
(gauge)
Bucketed histogram of db compaction total duration.
shown as millisecond
etcd.debugging.mvcc.db.total.size.in_bytes
(gauge)
Total size of the underlying database in bytes.
shown as byte
etcd.debugging.mvcc.delete.total
(count)
Total number of deletes seen by this member.
shown as query
etcd.debugging.mvcc.events.total
(count)
Total number of events sent by this member.
shown as event
etcd.debugging.mvcc.index.compaction.pause.duration.milliseconds
(gauge)
Bucketed histogram of index compaction pause duration.
shown as millisecond
etcd.debugging.mvcc.keys.total
(gauge)
Total number of keys.
shown as key
etcd.debugging.mvcc.pending.events.total
(gauge)
Total number of pending events to be sent.
shown as event
etcd.debugging.mvcc.put.total
(count)
Total number of puts seen by this member.
shown as query
etcd.debugging.mvcc.range.total
(count)
Total number of ranges seen by this member.
shown as query
etcd.debugging.mvcc.slow_watcher.total
(gauge)
Total number of unsynced slow watchers.
shown as connection
etcd.debugging.mvcc.txn.total
(count)
Total number of txns seen by this member.
shown as transaction
etcd.debugging.mvcc.watch_stream.total
(gauge)
Total number of watch streams.
shown as connection
etcd.debugging.mvcc.watcher.total
(gauge)
Total number of watchers.
shown as connection
etcd.debugging.server.lease.expired.total
(count)
The total number of expired leases.
shown as item
etcd.debugging.snap.save.marshalling.duration.seconds
(gauge)
The marshalling cost distributions of save called by snapshot.
shown as second
etcd.debugging.snap.save.total.duration.seconds
(gauge)
The total latency distributions of save called by snapshot.
shown as second
etcd.debugging.store.expires.total
(count)
Total number of expired keys.
shown as key
etcd.debugging.store.reads.total
(count)
Total number of reads action by (get/getRecursive), local to this member.
shown as read
etcd.debugging.store.watch.requests.total
(count)
Total number of incoming watch requests (new or reestablished).
shown as request
etcd.debugging.store.watchers
(gauge)
Count of currently active watchers.
shown as connection
etcd.debugging.store.writes.total
(count)
Total number of writes (e.g. set/compareAndDelete) seen by this member.
shown as write
etcd.disk.backend.commit.duration.seconds
(gauge)
The latency distributions of commit called by backend.
shown as second
etcd.disk.backend.snapshot.duration.seconds
(gauge)
The latency distribution of backend snapshots.
shown as second
etcd.disk.wal.fsync.duration.seconds
(gauge)
The latency distributions of fsync called by wal.
shown as second
etcd.grpc.proxy.cache.hits.total
(gauge)
Total number of cache hits
shown as occurrence
etcd.grpc.proxy.cache.keys.total
(gauge)
Total number of keys/ranges cached
shown as item
etcd.grpc.proxy.cache.misses.total
(gauge)
Total number of cache misses
shown as occurrence
etcd.grpc.proxy.events.coalescing.total
(count)
Total number of events coalescing
shown as event
etcd.grpc.proxy.watchers.coalescing.total
(gauge)
Total number of current watchers coalescing
shown as connection
etcd.network.client.grpc.received.bytes.total
(count)
The total number of bytes received from grpc clients.
shown as byte
etcd.network.client.grpc.sent.bytes.total
(count)
The total number of bytes sent to grpc clients.
shown as byte
etcd.network.peer.received.bytes.total
(count)
The total number of bytes received from peers.
shown as byte
etcd.network.peer.round_trip_time.seconds
(gauge)
Round-Trip-Time histogram between peers.
shown as second
etcd.network.peer.sent.bytes.total
(count)
The total number of bytes sent to peers.
shown as byte
etcd.server.has_leader
(gauge)
Whether or not a leader exists. 1 is existence, 0 is not.
shown as check
etcd.server.is_leader
(gauge)
Whether or not this member is a leader. 1 if is, 0 otherwise.
shown as check
etcd.server.leader.changes.seen.total
(count)
The number of leader changes seen.
shown as event
etcd.server.proposals.applied.total
(gauge)
The total number of consensus proposals applied.
shown as occurrence
etcd.server.proposals.committed.total
(gauge)
The total number of consensus proposals committed.
shown as occurrence
etcd.server.proposals.failed.total
(count)
The total number of failed proposals seen.
shown as occurrence
etcd.server.proposals.pending
(gauge)
The current number of pending proposals to commit.
shown as occurrence
etcd.server.version
(gauge)
Which version is running. 1 for 'server_version' label with current version.
shown as item
etcd.go.gc.duration.seconds
(gauge)
A summary of the GC invocation durations.
shown as second
etcd.go.goroutines
(gauge)
Number of goroutines that currently exist.
shown as thread
etcd.go.info
(gauge)
Information about the Go environment.
shown as item
etcd.go.memstats.alloc.bytes
(gauge)
Number of bytes allocated and still in use.
shown as byte
etcd.go.memstats.alloc.bytes.total
(count)
Total number of bytes allocated, even if freed.
shown as byte
etcd.go.memstats.buck.hash.sys.bytes
(gauge)
Number of bytes used by the profiling bucket hash table.
shown as byte
etcd.go.memstats.frees.total
(count)
Total number of frees.
shown as occurrence
etcd.go.memstats.gc.cpu.fraction
(gauge)
The fraction of this program's available CPU time used by the GC since the program started.
shown as cpu
etcd.go.memstats.gc.sys.bytes
(gauge)
Number of bytes used for garbage collection system metadata.
shown as byte
etcd.go.memstats.heap.alloc.bytes
(gauge)
Number of heap bytes allocated and still in use.
shown as byte
etcd.go.memstats.heap.idle.bytes
(gauge)
Number of heap bytes waiting to be used.
shown as byte
etcd.go.memstats.heap.inuse.bytes
(gauge)
Number of heap bytes that are in use.
shown as byte
etcd.go.memstats.heap.objects
(gauge)
Number of allocated objects.
shown as item
etcd.go.memstats.heap.released.bytes
(gauge)
Number of heap bytes released to OS.
shown as byte
etcd.go.memstats.heap.sys.bytes
(gauge)
Number of heap bytes obtained from system.
shown as byte
etcd.go.memstats.last.gc.time.seconds
(gauge)
Number of seconds since 1970 of last garbage collection.
shown as second
etcd.go.memstats.lookups.total
(count)
Total number of pointer lookups.
shown as occurrence
etcd.go.memstats.mallocs.total
(count)
Total number of mallocs.
shown as occurrence
etcd.go.memstats.mcache.inuse.bytes
(gauge)
Number of bytes in use by mcache structures.
shown as byte
etcd.go.memstats.mcache.sys.bytes
(gauge)
Number of bytes used for mcache structures obtained from system.
shown as byte
etcd.go.memstats.mspan.inuse.bytes
(gauge)
Number of bytes in use by mspan structures.
shown as byte
etcd.go.memstats.mspan.sys.bytes
(gauge)
Number of bytes used for mspan structures obtained from system.
shown as byte
etcd.go.memstats.next.gc.bytes
(gauge)
Number of heap bytes when next garbage collection will take place.
shown as byte
etcd.go.memstats.other.sys.bytes
(gauge)
Number of bytes used for other system allocations.
shown as byte
etcd.go.memstats.stack.inuse.bytes
(gauge)
Number of bytes in use by the stack allocator.
shown as byte
etcd.go.memstats.stack.sys.bytes
(gauge)
Number of bytes obtained from system for stack allocator.
shown as byte
etcd.go.memstats.sys.bytes
(gauge)
Number of bytes obtained from system.
shown as byte
etcd.go.threads
(gauge)
Number of OS threads created.
shown as thread
etcd.grpc.server.handled.total
(count)
Total number of RPCs completed on the server, regardless of success or failure.
shown as operation
etcd.grpc.server.msg.received.total
(count)
Total number of RPC stream messages received on the server.
shown as operation
etcd.grpc.server.msg.sent.total
(count)
Total number of gRPC stream messages sent by the server.
shown as operation
etcd.grpc.server.started.total
(count)
Total number of RPCs started on the server.
shown as operation
etcd.process.cpu.seconds.total
(count)
Total user and system CPU time spent in seconds.
shown as cpu
etcd.process.max.fds
(gauge)
Maximum number of open file descriptors.
shown as item
etcd.process.open.fds
(gauge)
Number of open file descriptors.
shown as item
etcd.process.resident.memory.bytes
(gauge)
Resident memory size in bytes.
shown as byte
etcd.process.start.time.seconds
(gauge)
Start time of the process since unix epoch in seconds.
shown as second
etcd.process.virtual.memory.bytes
(gauge)
Virtual memory size in bytes.
shown as byte
etcd.store.gets.success
(gauge)
Rate of successful get requests
shown as request
etcd.store.gets.fail
(gauge)
Rate of failed get requests
shown as request
etcd.store.sets.success
(gauge)
Rate of successful set requests
shown as request
etcd.store.sets.fail
(gauge)
Rate of failed set requests
shown as request
etcd.store.delete.success
(gauge)
Rate of successful delete requests
shown as request
etcd.store.delete.fail
(gauge)
Rate of failed delete requests
shown as request
etcd.store.update.success
(gauge)
Rate of successful update requests
shown as request
etcd.store.update.fail
(gauge)
Rate of failed update requests
shown as request
etcd.store.create.success
(gauge)
Rate of successful create requests
shown as request
etcd.store.create.fail
(gauge)
Rate of failed create requests
shown as request
etcd.store.compareandswap.success
(gauge)
Rate of compare and swap requests success
shown as request
etcd.store.compareandswap.fail
(gauge)
Rate of compare and swap requests failure
shown as request
etcd.store.compareanddelete.success
(gauge)
Rate of compare and delete requests success
shown as request
etcd.store.compareanddelete.fail
(gauge)
Rate of compare and delete requests failure
shown as request
etcd.store.expire.count
(gauge)
Rate of expired keys
shown as eviction
etcd.store.watchers
(gauge)
Rate of watchers
etcd.self.send.pkgrate
(gauge)
Rate of packets sent
shown as packet
etcd.self.send.bandwidthrate
(gauge)
Rate of bytes sent
shown as byte
etcd.self.recv.pkgrate
(gauge)
Rate of packets received
shown as packet
etcd.self.recv.bandwidthrate
(gauge)
Rate of bytes received
shown as byte
etcd.self.recv.appendrequest.count
(gauge)
Rate of append requests this node has processed
shown as request
etcd.self.send.appendrequest.count
(gauge)
Rate of append requests this node has sent
shown as request
etcd.leader.counts.fail
(gauge)
Rate of failed Raft RPC requests
shown as request
etcd.leader.counts.success
(gauge)
Rate of successful Raft RPC requests
shown as request
etcd.leader.latency.current
(gauge)
Current latency to each peer in the cluster
shown as millisecond
etcd.leader.latency.avg
(gauge)
Average latency to each peer in the cluster
shown as millisecond
etcd.leader.latency.min
(gauge)
Minimum latency to each peer in the cluster
shown as millisecond
etcd.leader.latency.max
(gauge)
Maximum latency to each peer in the cluster
shown as millisecond
etcd.leader.latency.stddev
(gauge)
Standard deviation latency to each peer in the cluster
shown as millisecond

etcd metrics are tagged with etcd_state:leader or etcd_state:follower, depending on the node status, so you can easily aggregate metrics by status.

Events

The Etcd check does not include any events at this time.

Service Checks

etcd.can_connect:

Returns ‘Critical’ if the Agent cannot collect metrics from your etcd API endpoint.

etcd.healthy:

Returns ‘Critical’ if a member node is not healthy. Returns ‘Unknown’ if the Agent can’t reach the /health endpoint, or if the health status is missing.

Troubleshooting

Need help? Contact Datadog Support.

Further Reading

To get a better idea of how (or why) to integrate etcd with Datadog, check out our blog post about it.


Mistake in the docs? Feel free to contribute!