The Process Check lets you:
The Process Check is included in the Datadog Agent package, so you don’t need to install anything else on your server.
Unlike many checks, the Process Check doesn’t monitor anything useful by default. You must configure which processes you want to monitor, and how.
While there’s no standard default check configuration, here’s an example process.d/conf.yaml
that monitors SSH/SSHD processes. See the sample process.d/conf.yaml for all available configuration options:
init_config:
instances:
## @param name - string - required
## Used to uniquely identify your metrics
## as they are tagged with this name in Datadog.
#
- name: ssh
## @param search_string - list of strings - required
## If one of the elements in the list matches, it return the count of
## all the processes that match the string exactly by default.
## Change this behavior with the parameter `exact_match: false`.
#
search_string: ["ssh", "sshd"]
Some process metrics require either running the Datadog collector as the same user as the monitored process or privileged access to be retrieved. Where the former option is not desired, and to avoid running the Datadog collector as root
, the try_sudo
option lets the Process Check try using sudo
to collect this metric. As of now, only the open_file_descriptors
metric on Unix platforms is taking advantage of this setting. Note: the appropriate sudoers rules have to be configured for this to work:
dd-agent ALL=NOPASSWD: /bin/ls /proc/*/fd/
Run the Agent’s status
subcommand and look for process
under the Checks section.
Note: Some metrics are not available on Linux or OSX:
/proc//io
) are only readable by the process’s owner. For more information, read the Agent FAQsystem.cpu.iowait
is not available on Windows.All metrics are per instance
configured in process.yaml, and are tagged process_name:<instance_name>
.
The system.processes.cpu.pct
metric sent by this check is only accurate for processes that live for more
than 30 seconds. Do not expect its value to be accurate for shorter-lived processes.
For the full list of metrics, see the Metrics section.
system.processes.cpu.pct (gauge) | The CPU utilization of a process. Shown as percent |
system.processes.cpu.normalized_pct (gauge) | The normalized CPU utilization of a process. Shown as percent |
system.processes.involuntary_ctx_switches (gauge) | The number of involuntary context switches performed by this process. Shown as event |
system.processes.ioread_bytes (gauge) | The number of bytes read from disk by this process. In Windows: the number of bytes read. Shown as byte |
system.processes.ioread_bytes_count (count) | The number of bytes read from disk by this process. In Windows: the number of bytes read. Shown as byte |
system.processes.ioread_count (gauge) | The number of disk reads by this process. In Windows: the number of reads by this process. Shown as read |
system.processes.iowrite_bytes (gauge) | The number of bytes written to disk by this process. In Windows: the number of bytes written by this process. Shown as byte |
system.processes.iowrite_bytes_count (count) | The number of bytes written to disk by this process. In Windows: the number of bytes written by this process. Shown as byte |
system.processes.iowrite_count (gauge) | The number of disk writes by this process. In Windows: the number of writes by this process. Shown as write |
system.processes.mem.page_faults.minor_faults (gauge) | The number of minor page faults per second for this process. Shown as occurrence |
system.processes.mem.page_faults.children_minor_faults (gauge) | The number of minor page faults per second for children of this process. Shown as occurrence |
system.processes.mem.page_faults.major_faults (gauge) | The number of major page faults per second for this process. Shown as occurrence |
system.processes.mem.page_faults.children_major_faults (gauge) | The number of major page faults per second for children of this process. Shown as occurrence |
system.processes.mem.pct (gauge) | The process memory consumption. Shown as percent |
system.processes.mem.real (gauge) | The non-swapped physical memory a process has used and cannot be shared with another process (Linux only). Shown as byte |
system.processes.mem.rss (gauge) | The non-swapped physical memory a process has used. aka "Resident Set Size". Shown as byte |
system.processes.mem.vms (gauge) | The total amount of virtual memory used by the process. aka "Virtual Memory Size". Shown as byte |
system.processes.number (gauge) | The number of processes. Shown as process |
system.processes.open_file_descriptors (gauge) | The number of file descriptors used by this process (only available for processes run as the dd-agent user) |
system.processes.open_handles (gauge) | The number of handles used by this process. |
system.processes.threads (gauge) | The number of threads used by this process. Shown as thread |
system.processes.voluntary_ctx_switches (gauge) | The number of voluntary context switches performed by this process. Shown as event |
system.processes.run_time.avg (gauge) | The average running time of all instances of this process Shown as second |
system.processes.run_time.max (gauge) | The longest running time of all instances of this process Shown as second |
system.processes.run_time.min (gauge) | The shortest running time of all instances of this process Shown as second |
The Process Check does not include any events.
process.up:
The Agent submits this service check for each instance in process.yaml
, tagging each with process:<name>
.
For an instance with no thresholds
specified, the service check has a status of either CRITICAL (zero processes running) or OK (at least one process running).
For an instance with thresholds
specified, consider this example:
instances:
- name: my_worker_process
search_string: ["/usr/local/bin/worker"]
thresholds:
critical: [1, 7]
warning: [3, 5]
The Agent submits a process.up
tagged process:my_worker_process
whose status is:
CRITICAL
when there are less than 1 or more than 7 worker processesWARNING
when there are 1, 2, 6, or 7 worker processesOK
when there are 3, 4, or 5 worker processesNeed help? Contact Datadog support.
To get a better idea of how (or why) to monitor process resource consumption with Datadog, check out this series of blog posts about it.