The Service Map for APM is here!

Mesos

Mesos master Dashboard

Overview

This check collects metrics from Mesos masters for:

  • Cluster resources
  • Slaves registered, active, inactive, connected, disconnected, etc
  • Number of tasks failed, finished, staged, running, etc
  • Number of frameworks active, inactive, connected, and disconnected

And many more.

Setup

Installation

The installation is the same on Mesos with and without DC/OS. Run the datadog-agent container on each of your Mesos master nodes:

docker run -d --name datadog-agent \
  -v /var/run/docker.sock:/var/run/docker.sock:ro \
  -v /proc/:/host/proc/:ro \
  -v /sys/fs/cgroup/:/host/sys/fs/cgroup:ro \
  -e DD_API_KEY=<YOUR_DATADOG_API_KEY> \
  -e MESOS_MASTER=true \
  -e MARATHON_URL=http://leader.mesos:8080 \
  datadog/agent:latest

Substitute your Datadog API key and Mesos Master’s API URL into the command above.

Configuration

If you passed the correct Master URL when starting datadog-agent, the Agent is already using a default mesos_master.d/conf.yaml to collect metrics from your masters; you don’t need to configure anything else. See the sample mesos_master.d/conf.yaml for all available configuration options.

Unless your masters’ API uses a self-signed certificate. In that case, set disable_ssl_validation: true in mesos_master.d/conf.yaml.

Log Collection

Datadog Agent version 6 and greater can collect logs from containers. You can either collect all logs from all your containers or filter them by container image name or container label to cherry pick what logs should be collected.

Add those extra variables to the Datadog Agent run command to start collecting logs:

  • -e DD_LOGS_ENABLED=true: this enables the log collection when set to true. The Agent now looks for log instructions in configuration files or container labels
  • -e DD_LOGS_CONFIG_CONTAINER_COLLECT_ALL=true: this enables log collection for all containers
  • -v /opt/datadog-agent/run:/opt/datadog-agent/run:rw: this mounts the directory the Agent uses to store pointers on each container logs to track what have been sent to Datadog or not.

This gives the following command:

docker run -d --name datadog-agent \
  -v /var/run/docker.sock:/var/run/docker.sock:ro \
  -v /proc/:/host/proc/:ro \
  -v /sys/fs/cgroup/:/host/sys/fs/cgroup:ro \
  -v /opt/datadog-agent/run:/opt/datadog-agent/run:rw \
  -e DD_API_KEY=<YOUR_DATADOG_API_KEY> \
  -e MESOS_MASTER=true \
  -e MARATHON_URL=http://leader.mesos:8080 \
  -e DD_LOGS_ENABLED=true \
  -e DD_LOGS_CONFIG_CONTAINER_COLLECT_ALL=true \
  datadog/agent:latest

Use the autodiscovery feature for logs to override the service and source attribute to make sure you benefit from the integration automatic setup.

Validation

In the Datadog app, search for mesos.cluster in the Metrics Explorer.

Data Collected

Metrics

mesos.framework.cpu
(gauge)
Framework cpu
mesos.framework.mem
(gauge)
Framework mem
shown as mebibyte
mesos.framework.disk
(gauge)
Framework disk
shown as mebibyte
mesos.role.cpu
(gauge)
Role cpu
mesos.role.mem
(gauge)
Role mem
shown as mebibyte
mesos.role.disk
(gauge)
Role disk
shown as mebibyte
mesos.cluster.tasks_error
(gauge)
Number of tasks that were invalid
shown as task
mesos.cluster.tasks_failed
(count)
Number of failed tasks
shown as task
mesos.cluster.tasks_finished
(count)
Number of finished tasks
shown as task
mesos.cluster.tasks_killed
(count)
Number of killed tasks
shown as task
mesos.cluster.tasks_lost
(count)
Number of lost tasks
shown as task
mesos.cluster.tasks_running
(gauge)
Number of running tasks
shown as task
mesos.cluster.tasks_staging
(gauge)
Number of staging tasks
shown as task
mesos.cluster.tasks_starting
(gauge)
Number of starting tasks
shown as task
mesos.cluster.slave_registrations
(gauge)
Number of slaves that were able to cleanly re-join the cluster and connect back to the master after the master is disconnected.
mesos.cluster.slave_removals
(gauge)
Number of slaves removed for various reasons, including maintenance
mesos.cluster.slave_reregistrations
(gauge)
Number of slave re-registrations
mesos.cluster.slave_shutdowns_canceled
(gauge)
Number of cancelled slave shutdowns
mesos.cluster.slave_shutdowns_scheduled
(gauge)
Number of slaves which have failed their health check and are scheduled to be removed
mesos.cluster.slaves_active
(gauge)
Number of active slaves
mesos.cluster.slaves_connected
(gauge)
Number of connected slaves
mesos.cluster.slaves_disconnected
(gauge)
Number of disconnected slaves
mesos.cluster.slaves_inactive
(gauge)
Number of inactive slaves
mesos.cluster.cpus_percent
(gauge)
Percentage of allocated CPUs
shown as percent
mesos.cluster.cpus_used
(gauge)
Number of allocated CPUs
mesos.cluster.cpus_total
(gauge)
Number of CPUs
mesos.cluster.gpus_percent
(gauge)
Percentage of allocated GPUs
shown as percent
mesos.cluster.gpus_used
(gauge)
Number of allocated GPUs
mesos.cluster.gpus_total
(gauge)
Number of GPUs
mesos.cluster.disk_percent
(gauge)
Percentage of allocated disk space
shown as percent
mesos.cluster.disk_used
(gauge)
Allocated disk space
shown as mebibyte
mesos.cluster.disk_total
(gauge)
Disk space
shown as mebibyte
mesos.cluster.mem_percent
(gauge)
Percentage of allocated memory
shown as percent
mesos.cluster.mem_used
(gauge)
Allocated memory
shown as mebibyte
mesos.cluster.mem_total
(gauge)
Total memory
shown as mebibyte
mesos.registrar.queued_operations
(gauge)
Number of queued operations
mesos.registrar.registry_size_bytes
(gauge)
Registry size
shown as byte
mesos.registrar.state_fetch_ms
(gauge)
Registry read latency
shown as millisecond
mesos.registrar.state_store_ms
(gauge)
Registry write latency
shown as millisecond
mesos.registrar.state_store_ms.count
(gauge)
Registry write count
mesos.registrar.state_store_ms.max
(gauge)
Maximum registry write latency
shown as millisecond
mesos.registrar.state_store_ms.min
(gauge)
Minimum registry write latency
shown as millisecond
mesos.registrar.state_store_ms.p50
(gauge)
Median registry write latency
shown as millisecond
mesos.registrar.state_store_ms.p90
(gauge)
90th percentile registry write latency
shown as millisecond
mesos.registrar.state_store_ms.p95
(gauge)
95th percentile registry write latency
shown as millisecond
mesos.registrar.state_store_ms.p99
(gauge)
99th percentile registry write latency
shown as millisecond
mesos.registrar.state_store_ms.p999
(gauge)
99.9th percentile registry write latency
shown as millisecond
mesos.registrar.state_store_ms.p9999
(gauge)
99.99th percentile registry write latency
shown as millisecond
mesos.registrar.log.recovered
(gauge)
Registrar log recovered
mesos.cluster.frameworks_active
(gauge)
Number of active frameworks
mesos.cluster.frameworks_connected
(gauge)
Number of connected frameworks
mesos.cluster.frameworks_disconnected
(gauge)
Number of disconnected frameworks
mesos.cluster.frameworks_inactive
(gauge)
Number of inactive frameworks
mesos.stats.system.cpus_total
(gauge)
Number of CPUs available
mesos.stats.system.load_15min
(gauge)
Load average for the past 15 minutes
mesos.stats.system.load_1min
(gauge)
Load average for the past minutes
mesos.stats.system.load_5min
(gauge)
Load average for the past 5 minutes
mesos.stats.system.mem_free_bytes
(gauge)
Free memory
shown as byte
mesos.stats.system.mem_total_bytes
(gauge)
Total memory
shown as byte
mesos.stats.elected
(gauge)
Whether this is the elected master
mesos.stats.uptime_secs
(gauge)
Uptime
shown as second
mesos.cluster.dropped_messages
(gauge)
Number of dropped messages
shown as message
mesos.cluster.outstanding_offers
(gauge)
Number of outstanding resource offers
mesos.cluster.event_queue_dispatches
(gauge)
Number of dispatches in the event queue
mesos.cluster.event_queue_http_requests
(gauge)
Number of HTTP requests in the event queue
shown as request
mesos.cluster.event_queue_messages
(gauge)
Number of messages in the event queue
shown as message
mesos.cluster.invalid_framework_to_executor_messages
(gauge)
Number of invalid framework messages
shown as message
mesos.cluster.invalid_status_update_acknowledgements
(gauge)
Number of invalid status update acknowledgements
mesos.cluster.invalid_status_updates
(gauge)
Number of invalid status updates
mesos.cluster.valid_framework_to_executor_messages
(gauge)
Number of valid framework messages
shown as message
mesos.cluster.valid_status_update_acknowledgements
(gauge)
Number of valid status update acknowledgements
mesos.cluster.valid_status_updates
(gauge)
Number of valid status updates
mesos.stats.registered
(gauge)
Whether this slave is registered with a master

Events

The Mesos-master check does not include any events at this time.

Service Checks

mesos_master.can_connect:

Returns CRITICAL if the Agent cannot connect to the Mesos Master API to collect metrics, otherwise OK.

Troubleshooting

Need help? Contact Datadog Support.

Further Reading

Mesos_slave Integration

Mesos Slave Dashboard

Overview

This Agent check collects metrics from Mesos slaves for:

  • System load
  • Number of tasks failed, finished, staged, running, etc
  • Number of executors running, terminated, etc

And many more.

This check also creates a service check for every executor task.

Setup

Installation

Follow the instructions in our blog post to install the Datadog Agent on each Mesos agent node via the DC/OS web UI.

Configuration

DC/OS

  1. In the DC/OS web UI, click on the Universe tab. Find the datadog package and click the Install button.
  2. Click the Advanced Installation button.
  3. Enter your Datadog API Key in the first field.
  4. In the Instances field, enter the number of slave nodes in your cluster (You can determine the number of nodes in your cluster by clicking the Nodes tab on the left side of the DC/OS web ui).
  5. Click Review and Install then Install

Marathon

If you are not using DC/OS, then use either the Marathon web UI or post to the API URL the following JSON to define the Datadog Agent application. You will need to change <YOUR_DATADOG_API_KEY> with your API Key and the number of instances with the number of slave nodes on your cluster. You may also need to update the docker image used to more recent tag. You can find the latest on Docker Hub

{
  "id": "/datadog-agent",
  "cmd": null,
  "cpus": 0.05,
  "mem": 256,
  "disk": 0,
  "instances": 1,
  "constraints": [["hostname","UNIQUE"],["hostname","GROUP_BY"]],
  "acceptedResourceRoles": ["slave_public","*"],
  "container": {
    "type": "DOCKER",
    "volumes": [
      {"containerPath": "/var/run/docker.sock","hostPath": "/var/run/docker.sock","mode": "RO"},
      {"containerPath": "/host/proc","hostPath": "/proc","mode": "RO"},
      {"containerPath": "/host/sys/fs/cgroup","hostPath": "/sys/fs/cgroup","mode": "RO"}
    ],
    "docker": {
      "image": "datadog/agent:latest",
      "network": "BRIDGE",
      "portMappings": [
        {"containerPort": 8125,"hostPort": 8125,"servicePort": 10000,"protocol": "udp","labels": {}},
        {"containerPort": 9001,"hostPort": 9001,"servicePort": 10001,"protocol": "tcp","labels": {}}
      ],
      "privileged": false,
      "parameters": [
        {"key": "name","value": "datadog-agent"},
        {"key": "env","value": "DD_API_KEY=<YOUR_DATADOG_API_KEY>"},
        {"key": "env","value": "MESOS_SLAVE=true"}
      ],
      "forcePullImage": false
    }
  },
  "healthChecks": [
    {
      "gracePeriodSeconds": 300,
      "intervalSeconds": 60,
      "timeoutSeconds": 20,
      "maxConsecutiveFailures": 3,
      "portIndex": 1,
      "path": "/",
      "protocol": "HTTP",
      "ignoreHttp1xx": false
    }
  ],
  "portDefinitions": [
    {"port": 10000,"protocol": "tcp","name": "default","labels": {}},
    {"port": 10001,"protocol": "tcp","labels": {}}
  ]
}

Unless you want to configure a custom mesos_slave.d/conf.yaml-perhaps you need to set disable_ssl_validation: true-you don’t need to do anything after installing the Agent.

Validation

DC/OS

Under the Services tab in the DC/OS web UI you should see the Datadog Agent shown. In the Datadog app, search for mesos.slave in the Metrics Explorer.

Marathon

If you are not using DC/OS, then datadog-agent is in the list of running applications with a healthy status. In the Datadog app, search for mesos.slave in the Metrics Explorer.

Data Collected

Metrics

mesos.stats.system.cpus_total
(gauge)
Number of CPUs available
mesos.stats.system.load_15min
(gauge)
Load average for the past 15 minutes
mesos.stats.system.load_1min
(gauge)
Load average for the past minutes
mesos.stats.system.load_5min
(gauge)
Load average for the past 5 minutes
mesos.stats.system.mem_free_bytes
(gauge)
Free memory
shown as byte
mesos.stats.system.mem_total_bytes
(gauge)
Total memory
shown as byte
mesos.state.task.cpu
(gauge)
Task cpu
mesos.state.task.mem
(gauge)
Task memory
shown as mebibyte
mesos.state.task.disk
(gauge)
Task disk
shown as mebibyte
mesos.slave.tasks_failed
(count)
Number of failed tasks
shown as task
mesos.slave.tasks_finished
(count)
Number of finished tasks
shown as task
mesos.slave.tasks_killed
(count)
Number of killed tasks
shown as task
mesos.slave.tasks_lost
(count)
Number of lost tasks
shown as task
mesos.slave.tasks_running
(gauge)
Number of running tasks
shown as task
mesos.slave.tasks_staging
(gauge)
Number of staging tasks
shown as task
mesos.slave.tasks_starting
(gauge)
Number of starting tasks
shown as task
mesos.stats.registered
(gauge)
Whether this slave is registered with a master
mesos.stats.uptime_secs
(gauge)
Slave uptime
mesos.slave.cpus_percent
(gauge)
Percentage of allocated CPUs
shown as percent
mesos.slave.cpus_used
(gauge)
Number of allocated CPUs
mesos.slave.cpus_total
(gauge)
Number of CPUs
mesos.slave.gpus_percent
(gauge)
Percentage of allocated GPUs
shown as percent
mesos.slave.gpus_used
(gauge)
Number of allocated GPUs
mesos.slave.gpus_total
(gauge)
Number of GPUs
mesos.slave.disk_percent
(gauge)
Percentage of allocated disk space
shown as percent
mesos.slave.disk_used
(gauge)
Allocated disk space
shown as mebibyte
mesos.slave.disk_total
(gauge)
Disk space
shown as mebibyte
mesos.slave.mem_percent
(gauge)
Percentage of allocated memory
shown as percent
mesos.slave.mem_used
(gauge)
Allocated memory
shown as mebibyte
mesos.slave.mem_total
(gauge)
Total memory
shown as mebibyte
mesos.slave.executors_registering
(gauge)
Number of executors registering
mesos.slave.executors_running
(gauge)
Number of executors running
mesos.slave.executors_terminated
(gauge)
Number of terminated executors
mesos.slave.executors_terminating
(gauge)
Number of terminating executors
mesos.slave.frameworks_active
(gauge)
Number of active frameworks
mesos.slave.invalid_framework_messages
(gauge)
Number of invalid framework messages
shown as message
mesos.slave.invalid_status_updates
(gauge)
Number of invalid status updates
mesos.slave.recovery_errors
(gauge)
Number of errors encountered during slave recovery
shown as error
mesos.slave.valid_framework_messages
(gauge)
Number of valid framework messages
shown as message
mesos.slave.valid_status_updates
(gauge)
Number of valid status updates

Events

The Mesos-slave check does not include any events at this time.

Service Check

mesos_slave.can_connect:

Returns CRITICAL if the Agent cannot connect to the Mesos slave metrics endpoint, otherwise OK.

<executor_task_name>.ok:

The mesos_slave check creates a service check for each executor task, giving it one of the following statuses:

Task status resultant service check status
TASK_STARTING AgentCheck.OK
TASK_RUNNING AgentCheck.OK
TASK_FINISHED AgentCheck.OK
TASK_FAILED AgentCheck.CRITICAL
TASK_KILLED AgentCheck.WARNING
TASK_LOST AgentCheck.CRITICAL
TASK_STAGING AgentCheck.OK
TASK_ERROR AgentCheck.CRITICAL

Troubleshooting

Need help? Contact Datadog Support.

Further Reading