Datadog-AWS ElastiCache Integration

ElastiCache Memcached default dashboard

Overview

Learn more about how to monitor ElastiCache performance metrics, whether you use Redis or Memcached, thanks to our series of posts. We detail the key performance metrics, how to collect them, and how Coursera monitors ElastiCache using Datadog.

Setup

Installation

To collect all available ElastiCache metrics, you need to do two things:

  1. Turn on the ElastiCache integration to pull metrics from ElastiCache into Datadog
  2. Set up the Datadog Agent as described in this article (optional, but recommended)

Collecting native metrics with the Agent

The following diagram shows how Datadog collects metrics directly from CloudWatch via the native ElastiCache integration, and how it can additionally collect native metrics directly from the backend technology: Redis or Memcached. By collecting from the backend directly, you will have access to a greater number of important metrics, and at a higher resolution.

ElastiCache, Redis and Memcached integrations

How this works

Because the agent metrics will be tied to the EC2 instance where the agent is running and not to the actual ElastiCache instance, you will need to use the cacheclusterid tag to connect all metrics together. Once the agent is configured with the same tags as the ElastiCache instance, combining Redis/Memcached metrics with ElastiCache metrics is straightforward.

Step-by-step

Since the Agent is not running on an actual ElastiCache instance, but on a remote machine, the key to setting up this integration correctly is telling the Agent where to collect the metrics from.

Gather connection details for your ElastiCache instance

First navigate to the AWS Console, open the ElastiCache section and then the Cache Clusters tab to find the cluster you want to monitor. It should look like:

ElastiCache Clusters in AWS console

Then click on the “node” link to access its endpoint URL:

Node link in AWS console

Write down the endpoint URL (e.g. replica-001.xxxx.use1.cache.amazonaws.com) and the cacheclusterid (e.g. replica-001). You will need these values to configure the agent and to create graphs and dashboards.

Configure the Agent

The Redis/Memcached integrations support the tagging of individual cache instances. Originally designed to allow the monitoring of multiple instances on the same machine, you can use these tags to your advantage. Here is an example of a configuration for ElastiCache with Redis using redisdb.yaml, usually found in /etc/dd-agent/conf.d

init_config:

instances:
  - host: replica-001.xxxx.use1.cache.amazonaws.com # Endpoint URL from AWS console
    port: 6379
    tags:
      - cacheclusterid:replicaa-001 # Cache Cluster ID from AWS console

Then restart the agent: sudo /etc/init.d/datadog-agent restart (on linux).

Visualize ElastiCache and Redis/Memcached metrics together

After a few minutes, ElastiCache metrics and Redis/Memcached metrics will be accessible in Datadog for graphing, monitoring, etc.

Here’s an example of setting up a graph to combine cache hit metrics from ElastiCache with native latency metrics from Redis using the same cacheclusterid tag replicaa-001.

ElastiCache and Cache metrics

Data Collected

Metrics

aws.elasticache.bytes_read_into_memcached
(count)
Memcached - The number of bytes that have been read from the network by the cache node.
shown as byte
aws.elasticache.bytes_used_for_cache
(gauge)
Redis - The total number of bytes allocated by Redis.
shown as byte
aws.elasticache.bytes_used_for_cache_items
(gauge)
Memcached - The number of bytes used to store cache items.
shown as byte
aws.elasticache.bytes_used_for_hash
(gauge)
Memcached - The number of bytes currently used by hash tables.
shown as byte
aws.elasticache.bytes_written_out_from_memcached
(count)
Memcached - The number of bytes that have been written to the network by the cache node.
shown as byte
aws.elasticache.cache_hits
(count)
Redis - The number of successful key lookups.
shown as hit
aws.elasticache.cache_misses
(count)
Redis - The number of unsuccessful key lookups.
shown as miss
aws.elasticache.cas_badval
(count)
Memcached - The number of CAS (check and set) requests the cache has received where the Cas value did not match the Cas value stored.
shown as
aws.elasticache.cas_hits
(count)
Memcached - The number of CAS requests the cache has received where the requested key was found and the Cas value matched.
shown as hit
aws.elasticache.cas_misses
(count)
Memcached - The number of CAS requests the cache has received where the key requested was not found.
shown as miss
aws.elasticache.cmd_config_get
(count)
Memcached - The cumulative number of config get requests.
shown as get
aws.elasticache.cmd_config_set
(count)
Memcached - The cumulative number of config set requests.
shown as set
aws.elasticache.cmd_flush
(count)
Memcached - The number of flush commands the cache has received.
shown as flush
aws.elasticache.cmd_get
(count)
Memcached - The number of get commands the cache has received.
shown as get
aws.elasticache.cmd_set
(count)
Memcached - The number of set commands the cache has received.
shown as set
aws.elasticache.cmd_touch
(count)
Memcached - The cumulative number of touch requests.
shown as
aws.elasticache.cpuutilization
(gauge)
The percentage of CPU utilization.
shown as percent
aws.elasticache.curr_config
(gauge)
Memcached - The current number of configurations stored.
shown as
aws.elasticache.curr_connections
(gauge)
Redis - The number of client connections, excluding connections from read replicas. Memcached - A count of the number of connections connected to the cache at an instant in time.
shown as connection
aws.elasticache.curr_items
(gauge)
Redis - The number of items in the cache. This is derived from the Redis keyspace statistic, summing all of the keys in the entire keyspace. Memcached - A count of the number of items currently stored in the cache.
shown as item
aws.elasticache.decr_hits
(count)
Memcached - The number of decrement requests the cache has received where the requested key was found.
shown as hit
aws.elasticache.decr_misses
(count)
Memcached - The number of decrement requests the cache has received where the requested key was not found.
shown as miss
aws.elasticache.delete_hits
(count)
Memcached - The number of delete requests the cache has received where the requested key was found.
shown as hit
aws.elasticache.delete_misses
(count)
Memcached - The number of delete requests the cache has received where the requested key was not found.
shown as miss
aws.elasticache.evicted_unfetched
(count)
Memcached - The number of valid items evicted from the least recently used cache (LRU) which were never touched after being set.
shown as item
aws.elasticache.evictions
(count)
Redis - The number of keys that have been evicted due to the maxmemory limit. Memcached - The number of non-expired items the cache evicted to allow space for new writes.
shown as eviction
aws.elasticache.expired_unfetched
(count)
Memcached - The number of expired items reclaimed from the LRU which were never touched after being set.
shown as item
aws.elasticache.freeable_memory
(gauge)
The amount of free memory available on the host.
shown as byte
aws.elasticache.get_hits
(count)
Memcached - The number of get requests the cache has received where the key requested was found.
shown as hit
aws.elasticache.get_misses
(count)
Memcached - The number of get requests the cache has received where the key requested was not found.
shown as miss
aws.elasticache.get_type_cmds
(count)
Redis - The total number of get types of commands. This is derived from the Redis commandstats statistic by summing all of the get types of commands (get, mget, hget, etc.)
shown as command
aws.elasticache.hash_based_cmds
(count)
Redis - The total number of commands that are hash-based. This is derived from the Redis commandstats statistic by summing all of the commands that act upon one or more hashes.
shown as command
aws.elasticache.hyper_log_log_based_cmds
(count)
Redis - The total number of HyperLogLog based commands. This is derived from the Redis commandstats statistic by summing all of the pf type of commands (pfadd, pfcount, pfmerge).
shown as command
aws.elasticache.incr_hits
(count)
Memcached - The number of increment requests the cache has received where the key requested was found.
shown as hit
aws.elasticache.incr_misses
(count)
Memcached - The number of increment requests the cache has received where the key requested was not found.
shown as miss
aws.elasticache.key_based_cmds
(count)
Redis - The total number of commands that are key-based. This is derived from the Redis commandstats statistic by summing all of the commands that act upon one or more keys.
shown as command
aws.elasticache.list_based_cmds
(count)
Redis - The total number of commands that are list-based. This is derived from the Redis commandstats statistic by summing all of the commands that act upon one or more lists.
shown as command
aws.elasticache.network_bytes_in
(count)
The number of bytes the host has read from the network.
shown as byte
aws.elasticache.network_bytes_out
(count)
The number of bytes the host has written to the network.
shown as byte
aws.elasticache.new_connections
(count)
Redis - The total number of connections that have been accepted by the server during this period. Memcached - The number of new connections the cache has received. This is derived from the memcached total_connections statistic by recording the change in total_connections across a period of time. This will always be at least 1, due to a connection reserved for a ElastiCache.
shown as connection
aws.elasticache.new_items
(count)
Memcached - The number of new items the cache has stored. This is derived from the memcached total_items statistic by recording the change in total_items across a period of time.
shown as item
aws.elasticache.reclaimed
(count)
Redis - The total number of key expiration events. Memcached - The number of expired items the cache evicted to allow space for new writes.
shown as
aws.elasticache.replication_bytes
(gauge)
Redis - For primaries with attached replicas, ReplicationBytes reports the number of bytes that the primary is sending to all of its replicas. This metric is representative of the write load on the replication group. For replicas and standalone primaries, ReplicationBytes is always 0.
shown as byte
aws.elasticache.replication_lag
(gauge)
Redis - This metric is only applicable for a cache node running as a read replica. It represents how far behind, in seconds, the replica is in applying changes from the primary cache cluster.
shown as second
aws.elasticache.save_in_progress
(gauge)
Redis - This binary metric returns 1 whenever a background save (forked or forkless) is in progress, and 0 otherwise. A background save process is typically used during snapshots and syncs. These operations can cause degraded performance. Using the SaveInProgress metric, you can diagnose whether or not degraded performance was caused by a background save process.
shown as
aws.elasticache.set_based_cmds
(count)
Redis - The total number of commands that are set-based. This is derived from the Redis commandstats statistic by summing all of the commands that act upon one or more sets.
shown as command
aws.elasticache.set_type_cmds
(count)
Redis - The total number of set types of commands. This is derived from the Redis commandstats statistic by summing all of the set types of commands (set, hset, etc.)
shown as command
aws.elasticache.slabs_moved
(count)
Memcached - The total number of slab pages that have been moved.
shown as
aws.elasticache.sorted_set_based_cmds
(count)
Redis - The total number of commands that are sorted set-based. This is derived from the Redis commandstats statistic by summing all of the commands that act upon one or more sorted sets.
shown as command
aws.elasticache.string_based_cmds
(count)
Redis - The total number of commands that are string-based. This is derived from the Redis commandstats statistic by summing all of the commands that act upon one or more strings.
shown as command
aws.elasticache.swap_usage
(gauge)
The amount of swap used on the host.
shown as byte
aws.elasticache.touch_hits
(count)
Memcached - The number of keys that have been touched and were given a new expiration time.
shown as hit
aws.elasticache.touch_misses
(count)
Memcached - The number of items that have been touched, but were not found.
shown as miss
aws.elasticache.unused_memory
(gauge)
Memcached - The amount of unused memory the cache can use to store items. This is derived from the memcached statistics limit_maxbytes and bytes by subtracting bytes from limit_maxbytes.
shown as byte
aws.elasticache.is_master
(gauge)
Redis - Returns 1 if the node is master, 0 otherwise.
shown as
aws.elasticache.geo_spatial_based_cmds
(count)
Redis - The total number of geo spatial based commands.
shown as