Datadog-Mesos & DC/OS Slave Integration

Overview

This Agent check collects metrics from Mesos slaves for:

  • System load
  • Number of tasks failed, finished, staged, running, etc
  • Number of executors running, terminated, etc

And many more.

This check also creates a service check for every executor task.

Setup

Installation

Follow the instructions in our blog post to install the Datadog Agent on each Mesos agent node via the DC/OS web UI.

Configuration

DC/OS

  1. In the DC/OS web UI, click on the Universe tab. Find the datadog package and click the Install button.
  2. Click the Advanced Installation button.
  3. Enter your Datadog API Key in the first field.
  4. In the Instances field, enter the number of slave nodes in your cluster (You can determine the number of nodes in your cluster by clicking the Nodes tab on the left side of the DC/OS web ui).
  5. Click Review and Install then Install

Marathon

If you are not using DC/OS, then use either the Marathon web UI or post to the API URL the following JSON to define the Datadog Agent application. You will need to change the DATADOGAPIKEY with your API Key and the number of instances with the number of slave nodes on your cluster. You may also need to update the docker image used to more recent tag. You can find the latest on Docker Hub

{
  "id": "/datadog-agent",
  "cmd": null,
  "cpus": 0.05,
  "mem": 256,
  "disk": 0,
  "instances": 1,
  "constraints": [["hostname","UNIQUE"],["hostname","GROUP_BY"]],
  "acceptedResourceRoles": ["slave_public","*"],
  "container": {
    "type": "DOCKER",
    "volumes": [
      {"containerPath": "/var/run/docker.sock","hostPath": "/var/run/docker.sock","mode": "RO"},
      {"containerPath": "/host/proc","hostPath": "/proc","mode": "RO"},
      {"containerPath": "/host/sys/fs/cgroup","hostPath": "/sys/fs/cgroup","mode": "RO"}
    ],
    "docker": {
      "image": "datadog/docker-dd-agent:11.0.5160",
      "network": "BRIDGE",
      "portMappings": [
        {"containerPort": 8125,"hostPort": 8125,"servicePort": 10000,"protocol": "udp","labels": {}},
        {"containerPort": 9001,"hostPort": 9001,"servicePort": 10001,"protocol": "tcp","labels": {}}
      ],
      "privileged": false,
      "parameters": [
        {"key": "name","value": "dd-agent"},
        {"key": "env","value": "API_KEY=DATADOGAPIKEY"},
        {"key": "env","value": "MESOS_SLAVE=true"},
        {"key": "env","value": "SD_BACKEND=docker"}
      ],
      "forcePullImage": false
    }
  },
  "healthChecks": [
    {
      "gracePeriodSeconds": 300,
      "intervalSeconds": 60,
      "timeoutSeconds": 20,
      "maxConsecutiveFailures": 3,
      "portIndex": 1,
      "path": "/",
      "protocol": "HTTP",
      "ignoreHttp1xx": false
    }
  ],
  "portDefinitions": [
    {"port": 10000,"protocol": "tcp","name": "default","labels": {}},
    {"port": 10001,"protocol": "tcp","labels": {}}
  ]
}

Unless you want to configure a custom mesos_slave.yaml—perhaps you need to set disable_ssl_validation: true—you don’t need to do anything after installing the Agent.

Validation

DC/OS

Under the Services tab in the DC/OS web UI you should see the Datadog Agent shown. In the Datadog app, search for mesos.slave in the Metrics Explorer.

Marathon

If you are not using DC/OS, then datadog-agent will be in the list of running applications with a healthy status. In the Datadog app, search for mesos.slave in the Metrics Explorer.

Compatibility

The mesos_slave check is compatible with all major platforms.

Data Collected

Metrics

mesos.stats.system.cpus_total
(gauge)
Number of CPUs available
shown as
mesos.stats.system.load_15min
(gauge)
Load average for the past 15 minutes
shown as
mesos.stats.system.load_1min
(gauge)
Load average for the past minutes
shown as
mesos.stats.system.load_5min
(gauge)
Load average for the past 5 minutes
shown as
mesos.stats.system.mem_free_bytes
(gauge)
Free memory
shown as byte
mesos.stats.system.mem_total_bytes
(gauge)
Total memory
shown as byte
mesos.state.task.cpu
(gauge)
Task cpu
shown as
mesos.state.task.mem
(gauge)
Task memory
shown as mebibyte
mesos.state.task.disk
(gauge)
Task disk
shown as mebibyte
mesos.slave.tasks_failed
(count)
Number of failed tasks
shown as task
mesos.slave.tasks_finished
(count)
Number of finished tasks
shown as task
mesos.slave.tasks_killed
(count)
Number of killed tasks
shown as task
mesos.slave.tasks_lost
(count)
Number of lost tasks
shown as task
mesos.slave.tasks_running
(gauge)
Number of running tasks
shown as task
mesos.slave.tasks_staging
(gauge)
Number of staging tasks
shown as task
mesos.slave.tasks_starting
(gauge)
Number of starting tasks
shown as task
mesos.stats.registered
(gauge)
Whether this slave is registered with a master
shown as
mesos.stats.uptime_secs
(gauge)
Slave uptime
shown as
mesos.slave.cpus_percent
(gauge)
Percentage of allocated CPUs
shown as percent
mesos.slave.cpus_used
(gauge)
Number of allocated CPUs
shown as
mesos.slave.cpus_total
(gauge)
Number of CPUs
shown as
mesos.slave.gpus_percent
(gauge)
Percentage of allocated GPUs
shown as percent
mesos.slave.gpus_used
(gauge)
Number of allocated GPUs
shown as
mesos.slave.gpus_total
(gauge)
Number of GPUs
shown as
mesos.slave.disk_percent
(gauge)
Percentage of allocated disk space
shown as percent
mesos.slave.disk_used
(gauge)
Allocated disk space
shown as mebibyte
mesos.slave.disk_total
(gauge)
Disk space
shown as mebibyte
mesos.slave.mem_percent
(gauge)
Percentage of allocated memory
shown as percent
mesos.slave.mem_used
(gauge)
Allocated memory
shown as mebibyte
mesos.slave.mem_total
(gauge)
Total memory
shown as mebibyte
mesos.slave.executors_registering
(gauge)
Number of executors registering
shown as
mesos.slave.executors_running
(gauge)
Number of executors running
shown as
mesos.slave.executors_terminated
(gauge)
Number of terminated executors
shown as
mesos.slave.executors_terminating
(gauge)
Number of terminating executors
shown as
mesos.slave.frameworks_active
(gauge)
Number of active frameworks
shown as
mesos.slave.invalid_framework_messages
(gauge)
Number of invalid framework messages
shown as message
mesos.slave.invalid_status_updates
(gauge)
Number of invalid status updates
shown as
mesos.slave.recovery_errors
(gauge)
Number of errors encountered during slave recovery
shown as error
mesos.slave.valid_framework_messages
(gauge)
Number of valid framework messages
shown as message
mesos.slave.valid_status_updates
(gauge)
Number of valid status updates
shown as

Events

The Mesos-slave check does not include any event at this time.

Service Check

mesos_slave.can_connect:

Returns CRITICAL if the Agent cannot connect to the Mesos slave metrics endpoint, otherwise OK.

<executor_task_name>.ok:

The mesos_slave check creates a service check for each executor task, giving it one of the following statuses:

Task statusresultant service check status
TASK_STARTINGAgentCheck.OK
TASK_RUNNINGAgentCheck.OK
TASK_FINISHEDAgentCheck.OK
TASK_FAILEDAgentCheck.CRITICAL
TASK_KILLEDAgentCheck.WARNING
TASK_LOSTAgentCheck.CRITICAL
TASK_STAGINGAgentCheck.OK
TASK_ERRORAgentCheck.CRITICAL

Troubleshooting

Need help? Contact Datadog Support.

Further Reading