Consul
New announcements from Dash: Incident Management, Continuous Profiler, and more! New announcements from Dash!

Consul

Agent Check Agent Check

Supported OS: Linux Mac OS Windows

Consul Dash

Overview

The Datadog Agent collects many metrics from Consul nodes, including those for:

  • Total Consul peers
  • Service health - for a given service, how many of its nodes are up, passing, warning, critical?
  • Node health - for a given node, how many of its services are up, passing, warning, critical?
  • Network coordinates - inter- and intra-datacenter latencies

The Consul Agent can provide further metrics via DogStatsD. These metrics are more related to the internal health of Consul itself, not to services which depend on Consul. There are metrics for:

  • Serf events and member flaps
  • The Raft protocol
  • DNS performance

And many more.

Finally, in addition to metrics, the Datadog Agent also sends a service check for each of Consul’s health checks, and an event after each new leader election.

Setup

Installation

The Datadog Agent’s Consul check is included in the Datadog Agent package, so you don’t need to install anything else on your Consul nodes.

Configuration

Host

To configure this check for an Agent running on a host:

Metric Collection
  1. Edit the consul.d/conf.yaml file, in the conf.d/ folder at the root of your Agent’s configuration directory to start collecting your Consul metrics. See the sample consul.d/conf.yaml for all available configuration options.

    init_config:
    
    instances:
     ## @param url - string - required
     ## Where your Consul HTTP Server Lives
     ## Point the URL at the leader to get metrics about your Consul Cluster.
     ## Remind to use https instead of http if your Consul setup is configured to do so.
     #
     - url: http://localhost:8500
  2. Restart the Agent.

Reload the Consul Agent to start sending more Consul metrics to DogStatsD.

Log collection

Available for Agent versions >6.0

  1. Collecting logs is disabled by default in the Datadog Agent, enable it in datadog.yaml with:

    logs_enabled: true
  2. Add this configuration block to your consul.yaml file to start collecting your Consul Logs:

    logs:
     - type: file
       path: /var/log/consul_server.log
       source: consul
       service: myservice

Change the path and service parameter values and configure them for your environment. See the sample consul.d/conf.yaml for all available configuration options.

  1. Restart the Agent.

Containerized

For containerized environments, see the Autodiscovery Integration Templates for guidance on applying the parameters below.

Metric collection
ParameterValue
<INTEGRATION_NAME>consul
<INIT_CONFIG>blank or {}
<INSTANCE_CONFIG>{"url": "https://%%host%%:8500"}
Log collection

Available for Agent versions >6.0

Collecting logs is disabled by default in the Datadog Agent. To enable it, see Kubernetes log collection documentation.

ParameterValue
<LOG_CONFIG>{"source": "consul", "service": "<SERVICE_NAME>"}

DogStatsD

Optionally, you can configure Consul to also send data to the Agent through DogStatsD instead of relying on the Agent to pull the data from Consul.

  1. Configure Consul to send DogStatsD metrics by adding the dogstatsd_addr nested under the top-level telemetry key in the main Consul configuration file:

    {
      ...
      "telemetry": {
        "dogstatsd_addr": "127.0.0.1:8125"
      },
      ...
    }
  2. Update the Datadog Agent main configuration file datadog.yaml by adding the following configs to ensure metrics are tagged correctly:

    # dogstatsd_mapper_cache_size: 1000  # default to 1000
    dogstatsd_mapper_profiles:
     - name: consul
       prefix: "consul."
       mappings:
         - match: 'consul\.http\.([a-zA-Z]+)\.(.*)'
           match_type: "regex"
           name: "consul.http.request"
           tags:
             http_method: "$1"
             path: "$2"
         - match: 'consul\.raft\.replication\.appendEntries\.logs\.([0-9a-f-]+)'
           match_type: "regex"
           name: "consul.raft.replication.appendEntries.logs"
           tags:
             consul_node_id: "$1"
         - match: 'consul\.raft\.replication\.appendEntries\.rpc\.([0-9a-f-]+)'
           match_type: "regex"
           name: "consul.raft.replication.appendEntries.rpc"
           tags:
             consul_node_id: "$1"
         - match: 'consul\.raft\.replication\.heartbeat\.([0-9a-f-]+)'
           match_type: "regex"
           name: "consul.raft.replication.heartbeat"
           tags:
             consul_node_id: "$1"
  3. Restart the Agent.

OpenMetrics

Instead of using DogStatsD, you can enable the use_prometheus_endpoint configuration option to get the same metrics from the Prometheus endpoint.

Note: Use either the DogStatsD method or the Prometheus method, do not enable both for the same instance.

  1. Configure Consul to expose metrics to the Prometheus endpoint. Set the prometheus_retention_time nested under the top-level telemetry key of the main Consul configuration file:

    {
      ...
      "telemetry": {
        "prometheus_retention_time": "360h"
      },
      ...
    }
  2. Edit the consul.d/conf.yaml file, in the conf.d/ folder at the root of your Agent’s configuration directory to start using the prometheus endpoint.

    instances:
        - url: <EXAMPLE>
          use_prometheus_endpoint: true
  3. Restart the Agent.

Validation

Run the Agent’s status subcommand and look for consul under the Checks section.

Note: If your Consul nodes have debug logging enabled, you’ll see the Datadog Agent’s regular polling in the Consul log:

2017/03/27 21:38:12 [DEBUG] http: Request GET /v1/status/leader (59.344us) from=127.0.0.1:53768
2017/03/27 21:38:12 [DEBUG] http: Request GET /v1/status/peers (62.678us) from=127.0.0.1:53770
2017/03/27 21:38:12 [DEBUG] http: Request GET /v1/health/state/any (106.725us) from=127.0.0.1:53772
2017/03/27 21:38:12 [DEBUG] http: Request GET /v1/catalog/services (79.657us) from=127.0.0.1:53774
2017/03/27 21:38:12 [DEBUG] http: Request GET /v1/health/service/consul (153.917us) from=127.0.0.1:53776
2017/03/27 21:38:12 [DEBUG] http: Request GET /v1/coordinate/datacenters (71.778us) from=127.0.0.1:53778
2017/03/27 21:38:12 [DEBUG] http: Request GET /v1/coordinate/nodes (84.95us) from=127.0.0.1:53780

Consul Agent to DogStatsD

Use netstat to verify that Consul is sending its metrics, too:

$ sudo netstat -nup | grep "127.0.0.1:8125.*ESTABLISHED"
udp        0      0 127.0.0.1:53874         127.0.0.1:8125          ESTABLISHED 23176/consul

Data Collected

Metrics

consul.catalog.nodes_critical
(gauge)
The number of nodes with service status `critical` from those registered
Shown as node
consul.catalog.nodes_passing
(gauge)
The number of nodes with service status `passing` from those registered
Shown as node
consul.catalog.nodes_up
(gauge)
The number of nodes
Shown as node
consul.catalog.nodes_warning
(gauge)
The number of nodes with service status `warning` from those registered
Shown as node
consul.catalog.total_nodes
(gauge)
The number of nodes registered in the consul cluster
Shown as node
consul.catalog.services_critical
(gauge)
Total critical services on nodes
Shown as service
consul.catalog.services_passing
(gauge)
Total passing services on nodes
Shown as service
consul.catalog.services_up
(gauge)
Total services registered on nodes
Shown as service
consul.catalog.services_warning
(gauge)
Total warning services on nodes
Shown as service
consul.catalog.services_count
(gauge)
Metrics to count the number of services matching criteria like the service tag, the node name, or the status. To be queried using the `sum by` aggregator.
Shown as service
consul.net.node.latency.min
(gauge)
minimum latency from this node to all others
Shown as millisecond
consul.net.node.latency.p25
(gauge)
p25 latency from this node to all others
Shown as millisecond
consul.net.node.latency.median
(gauge)
median latency from this node to all others
Shown as millisecond
consul.net.node.latency.p75
(gauge)
p75 latency from this node to all others
Shown as millisecond
consul.net.node.latency.p90
(gauge)
p90 latency from this node to all others
Shown as millisecond
consul.net.node.latency.p95
(gauge)
p95 latency from this node to all others
Shown as millisecond
consul.net.node.latency.p99
(gauge)
p99 latency from this node to all others
Shown as millisecond
consul.net.node.latency.max
(gauge)
maximum latency from this node to all others
Shown as millisecond
consul.peers
(gauge)
The number of peers in the peer set
consul.memberlist.degraded.probe
(gauge)
[DogStatsD only] This metric counts the number of times the Consul agent has performed failure detection on another agent at a slower probe rate. The agent uses its own health metric as an indicator to perform this action. (If its health score is low, means that the node is healthy, and vice versa.)
consul.memberlist.gossip.95percentile
(gauge)
[DogStatsD only] The p95 for the number of gossips (messages) broadcasted to a set of randomly selected nodes.
Shown as message
consul.memberlist.gossip.avg
(gauge)
[DogStatsD only] The avg for the number of gossips (messages) broadcasted to a set of randomly selected nodes.
Shown as message
consul.memberlist.gossip.count
(rate)
[DogStatsD only] The number of samples of consul.memberlist.gossip
consul.memberlist.gossip.max
(gauge)
[DogStatsD only] The max for the number of gossips (messages) broadcasted to a set of randomly selected nodes.
Shown as message
consul.memberlist.gossip.median
(gauge)
[DogStatsD only] The median for the number of gossips (messages) broadcasted to a set of randomly selected nodes.
Shown as message
consul.memberlist.health.score
(gauge)
[DogStatsD only] This metric describes a node's perception of its own health based on how well it is meeting the soft real-time requirements of the protocol. This metric ranges from 0 to 8, where 0 indicates "totally healthy". For more details see section IV of the Lifeguard paper: https://arxiv.org/pdf/1707.00788.pdf
consul.memberlist.msg.alive
(gauge)
[DogStatsD only] This metric counts the number of alive Consul agents, that the agent has mapped out so far, based on the message information given by the network layer.
consul.memberlist.msg.dead
(gauge)
[DogStatsD only] This metric counts the number of times a Consul agent has marked another agent to be a dead node.
Shown as message
consul.memberlist.msg.suspect
(rate)
[DogStatsD only] The number of times a Consul agent suspects another as failed while probing during gossip protocol
consul.memberlist.msg_alive
(gauge)
[DogStatsD only] This metric counts the number of alive Consul agents, that the agent has mapped out so far, based on the message information given by the network layer.
Shown as node
consul.memberlist.msg_dead
(gauge)
[DogStatsD only] This metric gives the number of dead Consul agents, that the agent has mapped out so far, based on the message information given by the network layer.
Shown as node
consul.memberlist.probenode.95percentile
(gauge)
[DogStatsD only] The p95 for the time taken to perform a single round of failure detection on a select Consul agent.
Shown as node
consul.memberlist.probenode.avg
(gauge)
[DogStatsD only] The avg for the time taken to perform a single round of failure detection on a select Consul agent.
Shown as node
consul.memberlist.probenode.count
(rate)
[DogStatsD only] The number of samples of consul.memberlist.probenode
consul.memberlist.probenode.max
(gauge)
[DogStatsD only] The max for the time taken to perform a single round of failure detection on a select Consul agent.
Shown as node
consul.memberlist.probenode.median
(gauge)
[DogStatsD only] The median for the time taken to perform a single round of failure detection on a select Consul agent.
Shown as node
consul.memberlist.pushpullnode.95percentile
(gauge)
[DogStatsD only] The p95 for the number of Consul agents that have exchanged state with this agent.
Shown as node
consul.memberlist.pushpullnode.avg
(gauge)
[DogStatsD only] The avg for the number of Consul agents that have exchanged state with this agent.
Shown as node
consul.memberlist.pushpullnode.count
(rate)
[DogStatsD only] The number of samples of consul.memberlist.pushpullnode
consul.memberlist.pushpullnode.max
(gauge)
[DogStatsD only] The max for the number of Consul agents that have exchanged state with this agent.
Shown as node
consul.memberlist.pushpullnode.median
(gauge)
[DogStatsD only] The median for the number of Consul agents that have exchanged state with this agent.
Shown as node
consul.memberlist.tcp.accept
(gauge)
[DogStatsD only] This metric counts the number of times a Consul agent has accepted an incoming TCP stream connection.
Shown as connection
consul.memberlist.tcp.connect
(gauge)
[DogStatsD only] This metric counts the number of times a Consul agent has initiated a push/pull sync with an other agent.
Shown as connection
consul.memberlist.tcp.sent
(gauge)
[DogStatsD only] This metric measures the total number of bytes sent by a Consul agent through the TCP protocol
Shown as byte
consul.memberlist.udp.received
(gauge)
[DogStatsD only] This metric measures the total number of bytes sent/received by a Consul agent through the UDP protocol.
Shown as byte
consul.memberlist.udp.sent
(gauge)
[DogStatsD only] This metric measures the total number of bytes sent/received by a Consul agent through the UDP protocol.
Shown as byte
consul.client.rpc
(rate)
[DogStatsD only] This increments whenever a Consul agent in client mode makes an RPC request to a Consul server. This gives a measure of how much a given agent is loading the Consul servers. Currently, this is only generated by agents in client mode, not Consul servers.
Shown as request
consul.client.rpc.failed
(gauge)
[DogStatsD only] Increments whenever a Consul agent in client mode makes an RPC request to a Consul server and fails
Shown as request
consul.hosts_file.age
(gauge)
[DogStatsD only] Age of the hosts file
consul.http..
(gauge)
[DogStatsD only] This tracks how long it takes to service the given HTTP request for the given verb and path. Paths do not include details like service or key names, for these an underscore will be present as a placeholder (eg. consul.http.GET.v1.kv._)
Shown as millisecond
consul.runtime.num_goroutines
(gauge)
[DogStatsD only] The number of running goroutines
consul.runtime.alloc_bytes
(gauge)
[DogStatsD only] Current bytes allocated by the Consul process
Shown as byte
consul.runtime.heap_objects
(gauge)
[DogStatsD only] The number of objects allocated on the heap
Shown as object
consul.runtime.sys_bytes
(gauge)
[DogStatsD only] Total size of the virtual address space reserved by the Go runtime
Shown as byte
consul.runtime.malloc_count
(gauge)
[DogStatsD only] Cumulative count of heap objects allocated
Shown as object
consul.runtime.free_count
(gauge)
[DogStatsD only] Cumulative count of heap objects freed
Shown as object
consul.runtime.total_gc_pause_ns
(gauge)
[DogStatsD only] Cumulative nanoseconds in GC stop-the-world pauses since Consul started
Shown as nanosecond
consul.runtime.total_gc_runs
(gauge)
[DogStatsD only] The number of completed GC cycles
consul.runtime.gc_pause_ns.95percentile
(gauge)
[DogStatsD only] The p95 for the number of nanoseconds consumed by stop-the-world garbage collection (GC) pauses since Consul started.
Shown as nanosecond
consul.runtime.gc_pause_ns.avg
(gauge)
[DogStatsD only] The avg for the number of nanoseconds consumed by stop-the-world garbage collection (GC) pauses since Consul started.
Shown as nanosecond
consul.runtime.gc_pause_ns.count
(rate)
[DogStatsD only] The number of samples of consul.runtime.gc_pause_ns
consul.runtime.gc_pause_ns.max
(gauge)
[DogStatsD only] The max for the number of nanoseconds consumed by stop-the-world garbage collection (GC) pauses since Consul started.
Shown as nanosecond
consul.runtime.gc_pause_ns.median
(gauge)
[DogStatsD only] The median for the number of nanoseconds consumed by stop-the-world garbage collection (GC) pauses since Consul started.
Shown as nanosecond
consul.raft.state.leader
(rate)
[DogStatsD only] The number of completed leader elections
Shown as event
consul.raft.state.candidate
(rate)
[DogStatsD only] The number of initiated leader elections
Shown as event
consul.raft.apply
(rate)
[DogStatsD only] The number of raft transactions occurring
Shown as transaction
consul.raft.commitTime.avg
(gauge)
[DogStatsD only] The average time it takes to commit a new entry to the raft log on the leader
Shown as millisecond
consul.raft.commitTime.count
(rate)
[DogStatsD only] The number of samples of raft.commitTime
consul.raft.commitTime.max
(gauge)
[DogStatsD only] The max time it takes to commit a new entry to the raft log on the leader
Shown as millisecond
consul.raft.commitTime.median
(gauge)
[DogStatsD only] The median time it takes to commit a new entry to the raft log on the leader
Shown as millisecond
consul.raft.commitTime.95percentile
(gauge)
[DogStatsD only] The p95 time it takes to commit a new entry to the raft log on the leader
Shown as millisecond
consul.raft.leader.dispatchLog.avg
(gauge)
[DogStatsD only] The average time it takes for the leader to write log entries to disk
Shown as millisecond
consul.raft.leader.dispatchLog.count
(rate)
[DogStatsD only] The number of samples of raft.leader.dispatchLog
consul.raft.leader.dispatchLog.max
(gauge)
[DogStatsD only] The max time it takes for the leader to write log entries to disk
Shown as millisecond
consul.raft.leader.dispatchLog.median
(gauge)
[DogStatsD only] The median time it takes for the leader to write log entries to disk
Shown as millisecond
consul.raft.leader.dispatchLog.95percentile
(gauge)
[DogStatsD only] The p95 time it takes for the leader to write log entries to disk
Shown as millisecond
consul.raft.leader.lastContact.avg
(gauge)
[DogStatsD only] Average time elapsed since the leader was last able to check its lease with followers
Shown as millisecond
consul.raft.leader.lastContact.count
(rate)
[DogStatsD only] The number of samples of raft.leader.lastContact
consul.raft.leader.lastContact.max
(gauge)
[DogStatsD only] Max time elapsed since the leader was last able to check its lease with followers
Shown as millisecond
consul.raft.leader.lastContact.median
(gauge)
[DogStatsD only] Median time elapsed since the leader was last able to check its lease with followers
Shown as millisecond
consul.raft.leader.lastContact.95percentile
(gauge)
[DogStatsD only] P95 time elapsed since the leader was last able to check its lease with followers
Shown as millisecond
consul.serf.events
(rate)
[DogStatsD only] Incremented when a Consul agent processes a serf event
Shown as event
consul.serf.coordinate.adjustment_ms.95percentile
(gauge)
[DogStatsD only] The p95 in milliseconds for the node coordinate adjustment
Shown as millisecond
consul.serf.coordinate.adjustment_ms.avg
(gauge)
[DogStatsD only] The avg in milliseconds for the node coordinate adjustment
Shown as millisecond
consul.serf.coordinate.adjustment_ms.count
(rate)
[DogStatsD only] The number of samples of consul.serf.coordinate.adjustment_ms
consul.serf.coordinate.adjustment_ms.max
(gauge)
[DogStatsD only] The max in milliseconds for the node coordinate adjustment
Shown as millisecond
consul.serf.coordinate.adjustment_ms.median
(gauge)
[DogStatsD only] The median in milliseconds for the node coordinate adjustment
Shown as millisecond
consul.serf.member.flap
(rate)
[DogStatsD only] The number of times a Consul agent is marked dead and then quickly recovers
consul.serf.member.join
(rate)
[DogStatsD only] Incremented when a Consul agent processes a join event
Shown as event
consul.serf.member.update
(gauge)
[DogStatsD only] This increments when a Consul agent updates.
consul.serf.member.failed
(gauge)
[DogStatsD only] This increments when a Consul agent is marked dead. This can be an indicator of overloaded agents, network problems, or configuration errors where agents cannot connect to each other on the required ports.
consul.serf.member.left
(gauge)
[DogStatsD only] This increments when a Consul agent leaves the cluster.
consul.serf.msgs.received.95percentile
(gauge)
[DogStatsD only] The p95 for the number of serf messages received
Shown as message
consul.serf.msgs.received.avg
(gauge)
[DogStatsD only] The avg for the number of serf messages received
Shown as message
consul.serf.msgs.received.count
(rate)
[DogStatsD only] The count of serf messages received
consul.serf.msgs.received.max
(gauge)
[DogStatsD only] The max for the number of serf messages received
Shown as message
consul.serf.msgs.received.median
(gauge)
[DogStatsD only] The median for the number of serf messages received
Shown as message
consul.serf.msgs.sent.95percentile
(gauge)
[DogStatsD only] The p95 for the number of serf messages sent
Shown as message
consul.serf.msgs.sent.avg
(gauge)
[DogStatsD only] The avg for the number of serf messages sent
Shown as message
consul.serf.msgs.sent.count
(rate)
[DogStatsD only] The count of serf messages sent
consul.serf.msgs.sent.max
(gauge)
[DogStatsD only] The max for the number of serf messages sent
Shown as message
consul.serf.msgs.sent.median
(gauge)
[DogStatsD only] The median for the number of serf messages sent
Shown as message
consul.serf.queue.event.95percentile
(gauge)
[DogStatsD only] The p95 for the size of the serf event queue
consul.serf.queue.event.avg
(gauge)
[DogStatsD only] The avg size of the serf event queue
consul.serf.queue.event.count
(rate)
[DogStatsD only] The number of items in the serf event queue
consul.serf.queue.event.max
(gauge)
[DogStatsD only] The max size of the serf event queue
consul.serf.queue.event.median
(gauge)
[DogStatsD only] The median size of the serf event queue
consul.serf.queue.intent.95percentile
(gauge)
[DogStatsD only] The p95 for the size of the serf intent queue
consul.serf.queue.intent.avg
(gauge)
[DogStatsD only] The avg size of the serf intent queue
consul.serf.queue.intent.count
(rate)
[DogStatsD only] The number of items in the serf intent queue
consul.serf.queue.intent.max
(gauge)
[DogStatsD only] The max size of the serf intent queue
consul.serf.queue.intent.median
(gauge)
[DogStatsD only] The median size of the serf intent queue
consul.serf.queue.query.95percentile
(gauge)
[DogStatsD only] The p95 for the size of the serf query queue
consul.serf.queue.query.avg
(gauge)
[DogStatsD only] The avg size of the serf query queue
consul.serf.queue.query.count
(rate)
[DogStatsD only] The number of items in the serf query queue
consul.serf.queue.query.max
(gauge)
[DogStatsD only] The max size of the serf query queue
consul.serf.queue.query.median
(gauge)
[DogStatsD only] The median size of the serf query queue
consul.serf.snapshot.appendline.95percentile
(gauge)
[DogStatsD only] The p95 of the time taken by the Consul agent to append an entry into the existing log.
Shown as millisecond
consul.serf.snapshot.appendline.avg
(gauge)
[DogStatsD only] The avg of the time taken by the Consul agent to append an entry into the existing log.
Shown as millisecond
consul.serf.snapshot.appendline.count
(rate)
[DogStatsD only] The number of samples of consul.serf.snapshot.appendline
consul.serf.snapshot.appendline.max
(gauge)
[DogStatsD only] The max of the time taken by the Consul agent to append an entry into the existing log.
Shown as millisecond
consul.serf.snapshot.appendline.median
(gauge)
[DogStatsD only] The median of the time taken by the Consul agent to append an entry into the existing log.
Shown as millisecond
consul.serf.snapshot.compact.95percentile
(gauge)
[DogStatsD only] The p95 of the time taken by the Consul agent to compact a log. This operation occurs only when the snapshot becomes large enough to justify the compaction .
Shown as millisecond
consul.serf.snapshot.compact.avg
(gauge)
[DogStatsD only] The avg of the time taken by the Consul agent to compact a log. This operation occurs only when the snapshot becomes large enough to justify the compaction .
Shown as millisecond
consul.serf.snapshot.compact.count
(rate)
[DogStatsD only] The number of samples of consul.serf.snapshot.compact
consul.serf.snapshot.compact.max
(gauge)
[DogStatsD only] The max of the time taken by the Consul agent to compact a log. This operation occurs only when the snapshot becomes large enough to justify the compaction .
Shown as millisecond
consul.serf.snapshot.compact.median
(gauge)
[DogStatsD only] The median of the time taken by the Consul agent to compact a log. This operation occurs only when the snapshot becomes large enough to justify the compaction .
Shown as millisecond

See Consul’s Telemetry doc for a description of metrics the Consul Agent sends to DogStatsD.

See Consul’s Network Coordinates doc for details on how the network latency metrics are calculated.

Events

consul.new_leader:
The Datadog Agent emits an event when the Consul cluster elects a new leader, tagging it with prev_consul_leader, curr_consul_leader, and consul_datacenter.

Service Checks

consul.check:
The Datadog Agent submits a service check for each of Consul’s health checks, tagging each with:

  • service:<name>, if Consul reports a ServiceName
  • consul_service_id:<id>, if Consul reports a ServiceID

Troubleshooting

Need help? Contact Datadog support.

Further Reading