Datadog-Mesos & DC/OS Master Integration

Overview

This check collects metrics from Mesos masters for:

  • Cluster resources
  • Slaves registered, active, inactive, connected, disconnected, etc
  • Number of tasks failed, finished, staged, running, etc
  • Number of frameworks active, inactive, connected, and disconnected

And many more.

Setup

Installation

The installation is the same on Mesos with and without DC/OS. Run the docker-dd-agent container on each of your Mesos master nodes:

docker run -d --name dd-agent \
  -v /var/run/docker.sock:/var/run/docker.sock:ro \
  -v /proc/:/host/proc/:ro \
  -v /sys/fs/cgroup/:/host/sys/fs/cgroup:ro \
  -e API_KEY=<YOUR_DATADOG_API_KEY> \
  -e MESOS_MASTER=yes \
  -e MARATHON_URL=http://leader.mesos:8080 \
  -e SD_BACKEND=docker \
  datadog/docker-dd-agent:latest

Substitute your Datadog API key and Mesos Master’s API URL into the command above.

Configuration

If you passed the correct Master URL when starting docker-dd-agent, the Agent is already using a default mesos_master.yaml to collect metrics from your masters; you don’t need to configure anything else. See the sample mesos_master.yaml for all available configuration options.

Unless your masters’ API uses a self-signed certificate. In that case, set disable_ssl_validation: true in mesos_master.yaml.

Validation

In the Datadog app, search for mesos.cluster in the Metrics Explorer.

Compatibility

The mesos_master check is compatible with all major platforms.

Data Collected

Metrics

mesos.framework.cpu
(gauge)
Framework cpu
shown as
mesos.framework.mem
(gauge)
Framework mem
shown as mebibyte
mesos.framework.disk
(gauge)
Framework disk
shown as mebibyte
mesos.role.cpu
(gauge)
Role cpu
shown as
mesos.role.mem
(gauge)
Role mem
shown as mebibyte
mesos.role.disk
(gauge)
Role disk
shown as mebibyte
mesos.cluster.tasks_error
(gauge)
Number of tasks that were invalid
shown as task
mesos.cluster.tasks_failed
(count)
Number of failed tasks
shown as task
mesos.cluster.tasks_finished
(count)
Number of finished tasks
shown as task
mesos.cluster.tasks_killed
(count)
Number of killed tasks
shown as task
mesos.cluster.tasks_lost
(count)
Number of lost tasks
shown as task
mesos.cluster.tasks_running
(gauge)
Number of running tasks
shown as task
mesos.cluster.tasks_staging
(gauge)
Number of staging tasks
shown as task
mesos.cluster.tasks_starting
(gauge)
Number of starting tasks
shown as task
mesos.cluster.slave_registrations
(gauge)
Number of slaves that were able to cleanly re-join the cluster and connect back to the master after the master is disconnected.
shown as
mesos.cluster.slave_removals
(gauge)
Number of slaves removed for various reasons, including maintenance
shown as
mesos.cluster.slave_reregistrations
(gauge)
Number of slave re-registrations
shown as
mesos.cluster.slave_shutdowns_canceled
(gauge)
Number of cancelled slave shutdowns
shown as
mesos.cluster.slave_shutdowns_scheduled
(gauge)
Number of slaves which have failed their health check and are scheduled to be removed
shown as
mesos.cluster.slaves_active
(gauge)
Number of active slaves
shown as
mesos.cluster.slaves_connected
(gauge)
Number of connected slaves
shown as
mesos.cluster.slaves_disconnected
(gauge)
Number of disconnected slaves
shown as
mesos.cluster.slaves_inactive
(gauge)
Number of inactive slaves
shown as
mesos.cluster.cpus_percent
(gauge)
Percentage of allocated CPUs
shown as percent
mesos.cluster.cpus_used
(gauge)
Number of allocated CPUs
shown as
mesos.cluster.cpus_total
(gauge)
Number of CPUs
shown as
mesos.cluster.gpus_percent
(gauge)
Percentage of allocated GPUs
shown as percent
mesos.cluster.gpus_used
(gauge)
Number of allocated GPUs
shown as
mesos.cluster.gpus_total
(gauge)
Number of GPUs
shown as
mesos.cluster.disk_percent
(gauge)
Percentage of allocated disk space
shown as percent
mesos.cluster.disk_used
(gauge)
Allocated disk space
shown as mebibyte
mesos.cluster.disk_total
(gauge)
Disk space
shown as mebibyte
mesos.cluster.mem_percent
(gauge)
Percentage of allocated memory
shown as percent
mesos.cluster.mem_used
(gauge)
Allocated memory
shown as mebibyte
mesos.cluster.mem_total
(gauge)
Total memory
shown as mebibyte
mesos.registrar.queued_operations
(gauge)
Number of queued operations
shown as
mesos.registrar.registry_size_bytes
(gauge)
Registry size
shown as byte
mesos.registrar.state_fetch_ms
(gauge)
Registry read latency
shown as millisecond
mesos.registrar.state_store_ms
(gauge)
Registry write latency
shown as millisecond
mesos.registrar.state_store_ms.count
(gauge)
Registry write count
shown as
mesos.registrar.state_store_ms.max
(gauge)
Maximum registry write latency
shown as millisecond
mesos.registrar.state_store_ms.min
(gauge)
Minimum registry write latency
shown as millisecond
mesos.registrar.state_store_ms.p50
(gauge)
Median registry write latency
shown as millisecond
mesos.registrar.state_store_ms.p90
(gauge)
90th percentile registry write latency
shown as millisecond
mesos.registrar.state_store_ms.p95
(gauge)
95th percentile registry write latency
shown as millisecond
mesos.registrar.state_store_ms.p99
(gauge)
99th percentile registry write latency
shown as millisecond
mesos.registrar.state_store_ms.p999
(gauge)
99.9th percentile registry write latency
shown as millisecond
mesos.registrar.state_store_ms.p9999
(gauge)
99.99th percentile registry write latency
shown as millisecond
mesos.registrar.log.recovered
(gauge)
Registrar log recovered
shown as
mesos.cluster.frameworks_active
(gauge)
Number of active frameworks
shown as
mesos.cluster.frameworks_connected
(gauge)
Number of connected frameworks
shown as
mesos.cluster.frameworks_disconnected
(gauge)
Number of disconnected frameworks
shown as
mesos.cluster.frameworks_inactive
(gauge)
Number of inactive frameworks
shown as
mesos.stats.system.cpus_total
(gauge)
Number of CPUs available
shown as
mesos.stats.system.load_15min
(gauge)
Load average for the past 15 minutes
shown as
mesos.stats.system.load_1min
(gauge)
Load average for the past minutes
shown as
mesos.stats.system.load_5min
(gauge)
Load average for the past 5 minutes
shown as
mesos.stats.system.mem_free_bytes
(gauge)
Free memory
shown as byte
mesos.stats.system.mem_total_bytes
(gauge)
Total memory
shown as byte
mesos.stats.elected
(gauge)
Whether this is the elected master
shown as
mesos.stats.uptime_secs
(gauge)
Uptime
shown as second
mesos.cluster.dropped_messages
(gauge)
Number of dropped messages
shown as message
mesos.cluster.outstanding_offers
(gauge)
Number of outstanding resource offers
shown as
mesos.cluster.event_queue_dispatches
(gauge)
Number of dispatches in the event queue
shown as
mesos.cluster.event_queue_http_requests
(gauge)
Number of HTTP requests in the event queue
shown as request
mesos.cluster.event_queue_messages
(gauge)
Number of messages in the event queue
shown as message
mesos.cluster.invalid_framework_to_executor_messages
(gauge)
Number of invalid framework messages
shown as message
mesos.cluster.invalid_status_update_acknowledgements
(gauge)
Number of invalid status update acknowledgements
shown as
mesos.cluster.invalid_status_updates
(gauge)
Number of invalid status updates
shown as
mesos.cluster.valid_framework_to_executor_messages
(gauge)
Number of valid framework messages
shown as message
mesos.cluster.valid_status_update_acknowledgements
(gauge)
Number of valid status update acknowledgements
shown as
mesos.cluster.valid_status_updates
(gauge)
Number of valid status updates
shown as
mesos.stats.registered
(gauge)
Whether this slave is registered with a master
shown as

Events

The Mesos-master check does not include any event at this time.

Service Checks

mesos_master.can_connect:

Returns CRITICAL if the Agent cannot connect to the Mesos Master API to collect metrics, otherwise OK.

Troubleshooting

Need help? Contact Datadog Support.

Further Reading