Logging is here!

Processes

Agent Check Agent Check

Supported OS: Linux Mac OS Windows

Overview

The process check lets you:

  • Collect resource usage metrics for specific running processes on any host: CPU, memory, I/O, number of threads, etc
  • Use Process Monitors: configure thresholds for how many instances of a specific process ought to be running and get alerts when the thresholds aren’t met (see Service Checks below).

Setup

Installation

The process check is included in the Datadog Agent package, so you don’t need to install anything else on your server.

Configuration

Unlike many checks, the process check doesn’t monitor anything useful by default; you must tell it which processes you want to monitor, and how.

While there’s no standard default check configuration, here’s an example process.d/conf.yaml that monitors ssh/sshd processes. See the sample process.d/conf.yaml for all available configuration options:

init_config:

instances:
  - name: ssh
    search_string: ['ssh', 'sshd']

# To search for sshd processes using an exact cmdline
# - name: ssh
#   search_string: ['/usr/sbin/sshd -D']
#   exact_match: True

Our process check uses the psutil python package to check processes on your machine. By default this process check works on exact match and looks at the process names only. By setting exact_match to False in your yaml file, the agent looks at the command used to launch your process and recognizes every process that contains your keywords.

You can also configure the check to find any process by exact PID (pid) or pidfile (pid_file). If you provide more than one of search_string, pid, and pid_file, the check uses the first option it finds in that order (e.g. it uses search_string over pid_file if you configure both).

To have the check search for processes in a path other than /proc, set procfs_path: <your_proc_path> in datadog.conf, NOT in process.yaml (its use has been deprecated there). Set this to /host/proc if you’re running the Agent from a Docker container (i.e. docker-dd-agent) and want to monitor processes running on the server hosting your containers. You DON’T need to set this to monitor processes running in your containers; the Docker check monitors those.

Some process metrics require either running the datadog collector as the same user as the monitored process or privileged access to be retrieved. Where the former option is not desired, and to avoid running the datadog collector as root, the try_sudo option lets the process check try using sudo to collect this metric. As of now, only the open_file_descriptors metric on Unix platforms is taking advantage of this setting. Note: the appropriate sudoers rules have to be configured for this to work

dd-agent ALL=NOPASSWD: /bin/ls /proc/*/fd/

See the example configuration for more details on configuration options.

Restart the Agent to start sending process metrics and service checks to Datadog.

Validation

Run the Agent’s status subcommand and look for process under the Checks section.

Data Collected

Metrics

system.processes.cpu.pct
(gauge)
The process CPU utilization.
shown as percent
system.processes.involuntary_ctx_switches
(gauge)
The number of involuntary context switches performed by this process.
shown as event
system.processes.ioread_bytes
(gauge)
The number of bytes read from disk by this process.
shown as byte
system.processes.ioread_count
(gauge)
The number of disk reads by this process.
shown as read
system.processes.iowrite_bytes
(gauge)
The number of bytes written to disk by this process.
shown as byte
system.processes.iowrite_count
(gauge)
The number of disk writes by this process.
shown as write
system.processes.mem.page_faults.minor_faults
(gauge)
The number of minor page faults per second for this process.
shown as occurrence
system.processes.mem.page_faults.children_minor_faults
(gauge)
The number of minor page faults per second for children of this process.
shown as occurrence
system.processes.mem.page_faults.major_faults
(gauge)
The number of major page faults per second for this process.
shown as occurrence
system.processes.mem.page_faults.children_major_faults
(gauge)
The number of major page faults per second for children of this process.
shown as occurrence
system.processes.mem.pct
(gauge)
The process memory consumption.
shown as percent
system.processes.mem.real
(gauge)
The non-swapped physical memory a process has used and cannot be shared with another process.
shown as byte
system.processes.mem.rss
(gauge)
The non-swapped physical memory a process has used. aka "Resident Set Size".
shown as byte
system.processes.mem.vms
(gauge)
The total amount of virtual memory used by the process. aka "Virtual Memory Size".
shown as byte
system.processes.number
(gauge)
The number of processes.
shown as process
system.processes.open_file_descriptors
(gauge)
The number of file descriptors used by this process (only available for processes run as the dd-agent user)
system.processes.open_handles
(gauge)
The number of handles used by this process.
system.processes.threads
(gauge)
The number of threads used by this process.
shown as thread
system.processes.voluntary_ctx_switches
(gauge)
The number of voluntary context switches performed by this process.
shown as event
system.processes.run_time.avg
(gauge)
The average running time of all instances of this process
shown as second
system.processes.run_time.max
(gauge)
The longest running time of all instances of this process
shown as second
system.processes.run_time.min
(gauge)
The shortest running time of all instances of this process
shown as second

All metrics are per instance configured in process.yaml, and are tagged process_name:<instance_name>.

Events

The Process check does not include any event at this time.

Service Checks

process.up:

The Agent submits this service check for each instance in process.yaml, tagging each with process:<name>.

For an instance with no thresholds specified, the service check has a status of either CRITICAL (zero processes running) or OK (at least one process running).

For an instance with thresholds specified, consider this example:

instances:
  - name: my_worker_process
    search_string: ['/usr/local/bin/worker']
    thresholds:
      critical: [1, 7]
      warning: [3, 5]

The Agent submits a process.up tagged process:my_worker_process whose status is:

  • CRITICAL when there are less than 1 or more than 7 worker processes
  • WARNING when there are 1, 2, 6, or 7 worker processes
  • OK when there are 3, 4 or 5 worker processes

Troubleshooting

Need help? Contact Datadog Support.

Further Reading

To get a better idea of how (or why) to monitor process resource consumption with Datadog, check out our series of blog posts about it.