The Service Map for APM is here!

Consul

Agent Check Agent Check

Supported OS: Linux Mac OS Windows

Consul Dash

Overview

The Datadog Agent collects many metrics from Consul nodes, including those for:

  • Total Consul peers
  • Service health - for a given service, how many of its nodes are up, passing, warning, critical?
  • Node health - for a given node, how many of its services are up, passing, warning, critical?
  • Network coordinates - inter- and intra-datacenter latencies

The Consul Agent can provide further metrics via DogStatsD. These metrics are more related to the internal health of Consul itself, not to services which depend on Consul. There are metrics for:

  • Serf events and member flaps
  • The Raft protocol
  • DNS performance

And many more.

Finally, in addition to metrics, the Datadog Agent also sends a service check for each of Consul’s health checks, and an event after each new leader election.

Setup

Installation

The Datadog Agent’s Consul check is included in the Datadog Agent package, so you don’t need to install anything else on yourConsul nodes.

Configuration

Edit the consul.d/conf.yaml file, in the conf.d/ folder at the root of your Agent’s configuration directory to start collecting your Consul metrics and logs. See the sample consul.d/conf.yaml for all available configuration options.

Metric Collection

  1. Add this configuration block to your consul.d/conf.yaml file to start gathering your Consul Metrics:

    init_config:
    
    instances:
        # where the Consul HTTP Server Lives
        # use 'https' if Consul is configured for SSL
        - url: http://localhost:8500
          # again, if Consul is talking SSL
          # client_cert_file: '/path/to/client.concatenated.pem'
    
          # submit per-service node status and per-node service status?
          catalog_checks: true
    
          # emit leader election events
          self_leader_check: true
    
          network_latency_checks: true

    See the sample consul.d/conf.yaml for all available configuration options.

  2. Restart the Agent to start sending Consul metrics to Datadog.

Connect Consul Agent to DogStatsD

In the main Consul configuration file, add your dogstatsd_addr nested under the top-level telemetry key:

{
  ...
  "telemetry": {
    "dogstatsd_addr": "127.0.0.1:8125"
  },
  ...
}

Reload the Consul Agent to start sending more Consul metrics to DogStatsD.

Log Collection

Available for Agent >6.0

  1. Collecting logs is disabled by default in the Datadog Agent, enable it in datadog.yaml with:

    logs_enabled: true
  2. Add this configuration block to your consul.yaml file to start collecting your Consul Logs:

      logs:
          - type: file
            path: /var/log/consul_server.log
            source: consul
            service: myservice

    Change the path and service parameter values and configure them for your environment. See the sample consul.d/conf.yaml for all available configuration options.

  3. Restart the Agent.

Learn more about log collection in the log documentation

Validation

Run the Agent’s status subcommand and look for consul under the Checks section.

Note: If your Consul nodes have debug logging enabled, you’ll see the Datadog Agent’s regular polling in the Consul log:

    2017/03/27 21:38:12 [DEBUG] http: Request GET /v1/status/leader (59.344us) from=127.0.0.1:53768
    2017/03/27 21:38:12 [DEBUG] http: Request GET /v1/status/peers (62.678us) from=127.0.0.1:53770
    2017/03/27 21:38:12 [DEBUG] http: Request GET /v1/health/state/any (106.725us) from=127.0.0.1:53772
    2017/03/27 21:38:12 [DEBUG] http: Request GET /v1/catalog/services (79.657us) from=127.0.0.1:53774
    2017/03/27 21:38:12 [DEBUG] http: Request GET /v1/health/service/consul (153.917us) from=127.0.0.1:53776
    2017/03/27 21:38:12 [DEBUG] http: Request GET /v1/coordinate/datacenters (71.778us) from=127.0.0.1:53778
    2017/03/27 21:38:12 [DEBUG] http: Request GET /v1/coordinate/nodes (84.95us) from=127.0.0.1:53780

Consul Agent to DogStatsD

Use netstat to verify that Consul is sending its metrics, too:

$ sudo netstat -nup | grep "127.0.0.1:8125.*ESTABLISHED"
udp        0      0 127.0.0.1:53874         127.0.0.1:8125          ESTABLISHED 23176/consul

Data Collected

Metrics

consul.catalog.nodes_critical
(gauge)
Number of nodes with service status `critical` from those registered
shown as node
consul.catalog.nodes_passing
(gauge)
Number of nodes with service status `passing` from those registered
shown as node
consul.catalog.nodes_up
(gauge)
Number of nodes
shown as node
consul.catalog.nodes_warning
(gauge)
Number of nodes with service status `warning` from those registered
shown as node
consul.catalog.total_nodes
(gauge)
Number of nodes registered in the consul cluster
shown as node
consul.catalog.services_critical
(gauge)
Total critical services on nodes
shown as service
consul.catalog.services_passing
(gauge)
Total passing services on nodes
shown as service
consul.catalog.services_up
(gauge)
Total services registered on nodes
shown as service
consul.catalog.services_warning
(gauge)
Total warning services on nodes
shown as service
consul.net.node.latency.min
(gauge)
minimum latency from this node to all others
shown as millisecond
consul.net.node.latency.p25
(gauge)
p25 latency from this node to all others
shown as millisecond
consul.net.node.latency.median
(gauge)
median latency from this node to all others
shown as millisecond
consul.net.node.latency.p75
(gauge)
p75 latency from this node to all others
shown as millisecond
consul.net.node.latency.p90
(gauge)
p90 latency from this node to all others
shown as millisecond
consul.net.node.latency.p95
(gauge)
p95 latency from this node to all others
shown as millisecond
consul.net.node.latency.p99
(gauge)
p99 latency from this node to all others
shown as millisecond
consul.net.node.latency.max
(gauge)
maximum latency from this node to all others
shown as millisecond
consul.peers
(gauge)
Number of peers in the peer set
consul.runtime.num_goroutines
(gauge)
Number of running goroutines
consul.runtime.alloc_bytes
(gauge)
Current bytes allocated by the Consul process
shown as byte
consul.runtime.heap_objects
(gauge)
Number of objects allocated on the heap
shown as object
consul.runtime.sys_bytes
(gauge)
Total size of the virtual address space reserved by the Go runtime
shown as byte
consul.runtime.malloc_count
(gauge)
Cumulative count of heap objects allocated
shown as object
consul.runtime.free_count
(gauge)
Cumulative count of heap objects freed
shown as object
consul.runtime.total_gc_pause_ns
(gauge)
Cumulative nanoseconds in GC stop-the-world pauses since Consul started
shown as nanosecond
consul.runtime.total_gc_runs
(gauge)
Number of completed GC cycles
consul.raft.state.leader
(rate)
Number of completed leader elections
shown as event
consul.raft.state.candidate
(rate)
Number of initiated leader elections
shown as event
consul.raft.apply
(rate)
Number of raft transactions occurring
shown as transaction
consul.raft.commitTime.avg
(gauge)
The average time it takes to commit a new entry to the raft log on the leader
shown as millisecond
consul.raft.commitTime.count
(rate)
The number of samples of raft.commitTime
consul.raft.commitTime.max
(gauge)
The max time it takes to commit a new entry to the raft log on the leader
shown as millisecond
consul.raft.commitTime.median
(gauge)
The median time it takes to commit a new entry to the raft log on the leader
shown as millisecond
consul.raft.commitTime.95percentile
(gauge)
The p95 time it takes to commit a new entry to the raft log on the leader
shown as millisecond
consul.raft.leader.dispatchLog.avg
(gauge)
The average time it takes for the leader to write log entries to disk
shown as millisecond
consul.raft.leader.dispatchLog.count
(rate)
The number of samples of raft.leader.dispatchLog
consul.raft.leader.dispatchLog.max
(gauge)
The max time it takes for the leader to write log entries to disk
shown as millisecond
consul.raft.leader.dispatchLog.median
(gauge)
The median time it takes for the leader to write log entries to disk
shown as millisecond
consul.raft.leader.dispatchLog.95percentile
(gauge)
The p95 time it takes for the leader to write log entries to disk
shown as millisecond
consul.raft.leader.lastContact.avg
(gauge)
Average time elapsed since the leader was last able to check its lease with followers
shown as millisecond
consul.raft.leader.lastContact.count
(rate)
The number of samples of raft.leader.lastContact
consul.raft.leader.lastContact.max
(gauge)
Max time elapsed since the leader was last able to check its lease with followers
shown as millisecond
consul.raft.leader.lastContact.median
(gauge)
Median time elapsed since the leader was last able to check its lease with followers
shown as millisecond
consul.raft.leader.lastContact.95percentile
(gauge)
P95 time elapsed since the leader was last able to check its lease with followers
shown as millisecond
consul.memberlist.msg.suspect
(rate)
Number of times an agent suspects another as failed while probing during gossip protocol
consul.serf.member.flap
(rate)
Number of times an agent is marked dead and then quickly recovers
consul.serf.events
(rate)
Incremented when an agent processes a serf event
shown as event
consul.serf.member.join
(rate)
Incremented when an agent processes a join event
shown as event

See Consul’s Telemetry doc for a description of metrics the Consul Agent sends to DogStatsD.

See Consul’s Network Coordinates doc if you’re curious about how the network latency metrics are calculated.

Events

consul.new_leader:

The Datadog Agent emits an event when the Consul cluster elects a new leader, tagging it with prev_consul_leader, curr_consul_leader, and consul_datacenter.

Service Checks

consul.check:

The Datadog Agent submits a service check for each of Consul’s health checks, tagging each with:

  • service:<name>, if Consul reports a ServiceName
  • consul_service_id:<id>, if Consul reports a ServiceID

Troubleshooting

Need help? Contact Datadog Support.

Further Reading


Mistake in the docs? Feel free to contribute!