New announcements for Serverless, Network, RUM, and more from Dash!

Marathon

Agent Check Agent Check

Supported OS: Linux Mac OS

Overview

The Agent’s Marathon check lets you:

  • Track the state and health of every application: see configured memory, disk, cpu, and instances; monitor the number of healthy and unhealthy tasks
  • Monitor the number of queued applications and the number of deployments

Setup

Find below instructions to install and configure the check when running the Agent on a host. See the Autodiscovery Integration Templates documentation to learn how to apply those instructions to a containerized environment.

Installation

The Marathon check is included in the Datadog Agent package, so you don’t need to install anything else on your Marathon master.

Configuration

  1. Edit the marathon.d/conf.yaml file, in the conf.d/ folder at the root of your Agent’s configuration directory. See the sample marathon.d/conf.yaml for all available configuration options:

        init_config:
    
        instances:
          - url: https://<server>:<port> # the API endpoint of your Marathon master; required
        #   acs_url: https://<server>:<port> # if your Marathon master requires ACS auth
            user: <username> # the user for marathon API or ACS token authentication
            password: <password> # the password for marathon API or ACS token authentication

    The function of user and password depends on whether or not you configure acs_url; If you do, the Agent uses them to request an authentication token from ACS, which it then uses to authenticate to the Marathon API. Otherwise, the Agent uses user and password to directly authenticate to the Marathon API.

  2. Restart the Agent to begin sending Marathon metrics to Datadog.

Log Collection

Available for Agent >6.0

  1. Collecting logs is disabled by default in the Datadog Agent, enable it in your datadog.yaml file:

    logs_enabled: true
  2. Because Marathon uses logback, you can specify a custom log format. With Datadog, two formats are supported out of the box: the default one provided by Marathon and the Datadog recommended format. Add a file appender to your configuration as in the following example and replace $PATTERN$ with your selected format:

    • Marathon default: [%date] %-5level %message \(%logger:%thread\)%n
    • Datadog recommended: %d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n```
    <?xml version="1.0" encoding="UTF-8"?>
    
    <configuration>
        <shutdownHook class="ch.qos.logback.core.hook.DelayingShutdownHook"/>
        <appender name="stdout" class="ch.qos.logback.core.ConsoleAppender">
            <encoder>
                <pattern>[%date] %-5level %message \(%logger:%thread\)%n</pattern>
            </encoder>
        </appender>
        <appender name="async" class="ch.qos.logback.classic.AsyncAppender">
            <appender-ref ref="stdout" />
            <queueSize>1024</queueSize>
        </appender>
        <appender name="FILE" class="ch.qos.logback.core.FileAppender">
            <file>/var/log/marathon.log</file>
            <append>true</append>
            <!-- set immediateFlush to false for much higher logging throughput -->
            <immediateFlush>true</immediateFlush>
            <encoder>
                <pattern>$PATTERN$</pattern>
            </encoder>
        </appender>
        <root level="INFO">
            <appender-ref ref="async"/>
            <appender-ref ref="FILE"/>
        </root>
    </configuration>
  3. Add this configuration block to your marathon.d/conf.yaml file to start collecting your Marathon logs:

      logs:
        - type: file
          path: /var/log/marathon.log
          source: marathon
          service: <SERVICE_NAME>
  4. Restart the Agent.

Validation

Run the Agent’s status subcommand and look for marathon under the Checks section.

Data Collected

Metrics

marathon.apps
(gauge)
Number of applications
marathon.deployments
(gauge)
Number of running or pending deployments
marathon.backoffFactor
(gauge)
Backoff time multiplication factor for each consecutive failed task launch; tagged by app_id and version
marathon.backoffSeconds
(gauge)
Task backoff period; tagged by app_id and version
shown as second
marathon.cpus
(gauge)
Configured CPUs for each instance of a given application
marathon.disk
(gauge)
Configured CPU for each instance of a given application
shown as mebibyte
marathon.instances
(gauge)
Number of instances of a given application; tagged by app_id and version
marathon.mem
(gauge)
Configured memory for each instance of a given application; tagged by app_id and version
shown as mebibyte
marathon.taskRateLimit
(gauge)
The task rate limit for a given application; tagged by app_id and version
marathon.tasksRunning
(gauge)
Number of tasks running for a given application; tagged by app_id and version
shown as task
marathon.tasksStaged
(gauge)
Number of tasks staged for a given application; tagged by app_id and version
shown as task
marathon.tasksHealthy
(gauge)
Number of healthy tasks for a given application; tagged by app_id and version
shown as task
marathon.tasksUnhealthy
(gauge)
Number of unhealthy tasks for a given application; tagged by app_id and version
shown as task
marathon.queue.size
(gauge)
Number of app offer queues
shown as task
marathon.queue.count
(gauge)
Number of instances left to launch
shown as task
marathon.queue.delay
(gauge)
Wait before the next launch attempt
shown as second
marathon.queue.offers.processed
(gauge)
The number of processed offers for this launch attempt
shown as task
marathon.queue.offers.unused
(gauge)
The number of unused offers for this launch attempt
shown as task
marathon.queue.offers.reject.last
(gauge)
Summary of unused offers for all last offers
shown as task
marathon.queue.offers.reject.launch
(gauge)
Summary of unused offers for the launch attempt
shown as task

Events

The Marathon check does not include any events.

Service Checks

marathon.can_connect:

Returns CRITICAL if the Agent cannot connect to the Marathon API to collect metrics, otherwise OK.

Troubleshooting

Need help? Contact Datadog support.


Mistake in the docs? Feel free to contribute!