Overview

Configure your HTTP Client so that the Observability Pipelines Worker formats the logs collected into a Datadog-rehydratable format before routing them to Datadog Log Archives.

The log sources, processors, and destinations available for this use case

This document walks you through the following steps:

  1. The prerequisites needed to set up Observability Pipelines
  2. Configuring a Log Archive
  3. Setting up Observability Pipelines
  4. Sending logs to the Observability Pipelines Worker

Prerequisites

To use Observability Pipelines’s HTTP/S Client source, you need the following information available:

  1. The full path of the HTTP Server endpoint that the Observability Pipelines Worker collects log events from. For example, https://127.0.0.8/logs.
  2. The HTTP authentication token or password.

The HTTP/S Client source pulls data from your upstream HTTP server. Your HTTP server must support GET requests for the HTTP Client endpoint URL that you set as an environment variable when you install the Worker.

Configure Log Archives

If you already have a Datadog Log Archive configured for Observability Pipelines, skip to Set up Observability Pipelines.

You need to have the Datadog integration for your cloud provider installed to set up Datadog Log Archive. See AWS integration, Google Cloud Platform, and Azure integration documentation for more information.

Select the cloud provider you are using to archive your logs.

Create an Amazon S3 bucket

  1. Navigate to Amazon S3 buckets.
  2. Click Create bucket.
  3. Enter a descriptive name for your bucket.
  4. Do not make your bucket publicly readable.
  5. Optionally, add tags.
  6. Click Create bucket.

Set up an IAM policy that allows Workers to write to the S3 bucket

  1. Navigate to the IAM console.
  2. Select Policies in the left side menu.
  3. Click Create policy.
  4. Click JSON in the Specify permissions section.
  5. Copy the below policy and paste it into the Policy editor. Replace <MY_BUCKET_NAME> and <MY_BUCKET_NAME_1_/_MY_OPTIONAL_BUCKET_PATH_1> with the information for the S3 bucket you created earlier.
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "DatadogUploadAndRehydrateLogArchives",
                "Effect": "Allow",
                "Action": ["s3:PutObject", "s3:GetObject"],
                "Resource": "arn:aws:s3:::<MY_BUCKET_NAME_1_/_MY_OPTIONAL_BUCKET_PATH_1>/*"
            },
            {
                "Sid": "DatadogRehydrateLogArchivesListBucket",
                "Effect": "Allow",
                "Action": "s3:ListBucket",
                "Resource": "arn:aws:s3:::<MY_BUCKET_NAME>"
            }
        ]
    }
    
  6. Click Next.
  7. Enter a descriptive policy name.
  8. Optionally, add tags.
  9. Click Create policy.

Create an IAM user

Create an IAM user and attach the IAM policy you created earlier to it.

  1. Navigate to the IAM console.
  2. Select Users in the left side menu.
  3. Click Create user.
  4. Enter a username.
  5. Click Next.
  6. Select Attach policies directly.
  7. Choose the IAM policy you created earlier to attach to the new IAM user.
  8. Click Next.
  9. Optionally, add tags.
  10. Click Create user.

Create access credentials for the new IAM user. The AWS access key and AWS secret access key are added as environment variables in the Install the Observability Pipelines Worker step.

Create a service account

Create a service account to use the policy you created above.

Create an IAM user

Create an IAM user and attach the IAM policy you created earlier to it.

  1. Navigate to the IAM console.
  2. Select Users in the left side menu.
  3. Click Create user.
  4. Enter a username.
  5. Click Next.
  6. Select Attach policies directly.
  7. Choose the IAM policy you created earlier to attach to the new IAM user.
  8. Click Next.
  9. Optionally, add tags.
  10. Click Create user.

Create access credentials for the new IAM user. The AWS access key and AWS secret access key are added later as environment variables when you install the Observability Pipelines Worker.

Create an IAM user

Create an IAM user and attach the IAM policy you created earlier to it.

  1. Navigate to the IAM console.
  2. Select Users in the left side menu.
  3. Click Create user.
  4. Enter a username.
  5. Click Next.
  6. Select Attach policies directly.
  7. Choose the IAM policy you created earlier to attach to the new IAM user.
  8. Click Next.
  9. Optionally, add tags.
  10. Click Create user.

Create access credentials for the new IAM user. The AWS access key and AWS secret access key are added as environment variables in the Install the Observability Pipelines Worker step.

Connect the S3 bucket to Datadog Log Archives

  1. Navigate to Datadog Log Forwarding.
  2. Click New archive.
  3. Enter a descriptive archive name.
  4. Add a query that filters out all logs going through log pipelines so that none of those logs go into this archive. For example, add the query observability_pipelines_read_only_archive, assuming no logs going through the pipeline have that tag added.
  5. Select AWS S3.
  6. Select the AWS account that your bucket is in.
  7. Enter the name of the S3 bucket.
  8. Optionally, enter a path.
  9. Check the confirmation statement.
  10. Optionally, add tags and define the maximum scan size for rehydration. See Advanced settings for more information.
  11. Click Save.

See the Log Archives documentation for additional information.

Create a storage bucket

  1. Navigate to Google Cloud Storage.
  2. On the Buckets page, click Create to create a bucket for your archives..
  3. Enter a name for the bucket and choose where to store your data.
  4. Select Fine-grained in the Choose how to control access to objects section.
  5. Do not add a retention policy because the most recent data needs to be rewritten in some rare cases (typically a timeout case).
  6. Click Create.

Allow the Observability Pipeline Worker to write to the bucket

To authenticate the Observability Pipelines Worker for Google Cloud Storage, contact your Google Security Operations representative for a Google Developer Service Account Credential. This credential is a JSON file and must be placed under DD_OP_DATA_DIR/config. See Getting API authentication credential for more information.

Note: If you are installing the Worker in Kubernetes, see Referencing files in Kubernetes for information on how to reference the credentials file.

Connect the storage bucket to Datadog Log Archives

  1. Navigate to Datadog Log Forwarding.
  2. Click New archive.
  3. Enter a descriptive archive name.
  4. Add a query that filters out all logs going through log pipelines so that none of those logs go into this archive. For example, add the query observability_pipelines_read_only_archive, assuming no logs going through the pipeline have that tag added.
  5. Select Google Cloud Storage.
  6. Select the service account your storage bucket is in.
  7. Select the project.
  8. Enter the name of the storage bucket you created earlier.
  9. Optionally, enter a path.
  10. Optionally, set permissions, add tags, and define the maximum scan size for rehydration. See Advanced settings for more information.
  11. Click Save.

See the Log Archives documentation for additional information.

Create a storage account

Create an Azure storage account if you don’t already have one.

  1. Navigate to Storage accounts.
  2. Click Create.
  3. Select the subscription name and resource name you want to use.
  4. Enter a name for your storage account.
  5. Select a region in the dropdown menu.
  6. Select Standard performance or Premium account type.
  7. Click Next.
  8. In the Blob storage section, select Hot or Cool storage.
  9. Click Review + create.

Create a storage bucket

  1. In your storage account, click Containers under Data storage in the left navigation menu.
  2. Click + Container at the top to create a new container.
  3. Enter a name for the new container. This name is used later when you set up the Observability Pipelines Azure Storage destination.

Note: Do not set immutability policies because the most recent data might need to be rewritten in rare cases (typically when there is a timeout).

Connect the Azure container to Datadog Log Archives

  1. Navigate to Datadog Log Forwarding.
  2. Click New archive.
  3. Enter a descriptive archive name.
  4. Add a query that filters out all logs going through log pipelines so that none of those logs go into this archive. For example, add the query observability_pipelines_read_only_archive, assuming no logs going through the pipeline have that tag added.
  5. Select Azure Storage.
  6. Select the Azure tenant and client your storage account is in.
  7. Enter the name of the storage account.
  8. Enter the name of the container you created earlier.
  9. Optionally, enter a path.
  10. Optionally, set permissions, add tags, and define the maximum scan size for rehydration. See Advanced settings for more information.
  11. Click Save.

See the Log Archives documentation for additional information.

Set up Observability Pipelines

  1. Navigate to Observability Pipelines.
  2. Select the Archive Logs template to create a new pipeline.
  3. Select HTTP Client as the source.

Set up the source

To configure your HTTP/S Client source:

  1. Select your authorization strategy.
  2. Select the decoder you want to use on the HTTP messages. Logs pulled from the HTTP source must be in this format.
  3. Optionally, toggle the switch to enable TLS. If you enable TLS, the following certificate and key files are required:
    • Server Certificate Path: The path to the certificate file that has been signed by your Certificate Authority (CA) Root File in DER or PEM (X.509) format.
    • CA Certificate Path: The path to the certificate file that is your Certificate Authority (CA) Root File in DER or PEM (X.509) format.
    • Private Key Path: The path to the .key private key file that belongs to your Server Certificate Path in DER or PEM (PKCS#8) format.
  4. Enter the interval between scrapes.
    • Your HTTP Server must be able to handle GET requests at this interval.
    • Since requests run concurrently, if a scrape takes longer than the interval given, a new scrape is started, which can consume extra resources. Set the timeout to a value lower than the scrape interval to prevent this from happening.
  5. Enter the timeout for each scrape request.

Set up the destinations

Enter the following information based on your selected logs destination.

If the Worker is ingesting logs that are not coming from the Datadog Agent and are shipped to an archive using the Observability Pipelines Datadog Archives destination, those logs are not tagged with reserved attributes. In addition, logs rehydrated into Datadog will not have standard attributes mapped. This means that when you rehydrate your logs into Log Management, you may lose Datadog telemetry, the ability to search logs easily, and the benefits of unified service tagging if you do not structure and remap your logs in Observability Pipelines before routing your logs to an archive.

For example, say your syslogs are sent to Datadog Archives and those logs have the status tagged as severity instead of the reserved attribute of status and the host tagged as host-name instead of the reserved attribute hostname. When these logs are rehydrated in Datadog, the status for each log is set to info and none of the logs have a hostname tag.

Follow the instructions for the cloud provider you are using to archive your logs.

  1. Enter the S3 bucket name for the S3 bucket you created earlier.
  2. Enter the AWS region the S3 bucket is in.
  3. Enter the key prefix. Prefixes are useful for partitioning objects. For example, you can use a prefix as an object key to store objects under a particular directory. If using a prefix for this purpose, it must end in / to act as a directory path; a trailing / is not automatically added.
  4. Select the storage class for your S3 bucket in the Storage Class dropdown menu.

Your AWS access key ID and AWS secret access key are set as environment variables when you install the Worker later.

  1. Enter the name of the Google Cloud storage bucket you created earlier.
  2. Enter the path to the credentials JSON file you downloaded earlier.
  3. Select the storage class for the created objects.
  4. Select the access level of the created objects.
  5. Optionally, enter in the prefix. Prefixes are useful for partitioning objects. For example, you can use a prefix as an object key to store objects under a particular directory. If using a prefix for this purpose, it must end in / to act as a directory path; a trailing / is not automatically added.
  6. Optionally, click Add Header to add metadata.
  1. Enter the name of the Azure container you created earlier.
  2. Optionally, enter a prefix. Prefixes are useful for partitioning objects. For example, you can use a prefix as an object key to store objects under a particular directory. If using a prefix for this purpose, it must end in / to act as a directory path; a trailing / is not automatically added.

There are no configuration steps for your Datadog destination.

The following fields are optional:

  1. Enter the name of the Splunk index you want your data in. This has to be an allowed index for your HEC.
  2. Select whether the timestamp should be auto-extracted. If set to true, Splunk extracts the timestamp from the message with the expected format of yyyy-mm-dd hh:mm:ss.
  3. Set the sourcetype to override Splunk’s default value, which is httpevent for HEC data.

The following fields are optional:

  1. In the Encoding dropdown menu, select whether you want to encode your pipeline’s output in JSON, Logfmt, or Raw text. If no decoding is selected, the decoding defaults to JSON.
  2. Enter a source name to override the default name value configured for your Sumo Logic collector’s source.
  3. Enter a host name to override the default host value configured for your Sumo Logic collector’s source.
  4. Enter a category name to override the default category value configured for your Sumo Logic collector’s source.
  5. Click Add Header to add any custom header fields and values.
The rsyslog and syslog-ng destinations support the RFC5424 format.

The rsyslog and syslog-ng destinations match these log fields to the following Syslog fields:

Log EventSYSLOG FIELDDefault
log[“message”]MESSAGENIL
log[“procid”]PROCIDThe running Worker’s process ID.
log[“appname”]APP-NAMEobservability_pipelines
log[“facility”]FACILITY8 (log_user)
log[“msgid”]MSGIDNIL
log[“severity”]SEVERITYinfo
log[“host”]HOSTNAMENIL
log[“timestamp”]TIMESTAMPCurrent UTC time.

The following destination settings are optional:

  1. Toggle the switch to enable TLS. If you enable TLS, the following certificate and key files are required:
    • Server Certificate Path: The path to the certificate file that has been signed by your Certificate Authority (CA) Root File in DER or PEM (X.509).
    • CA Certificate Path: The path to the certificate file that is your Certificate Authority (CA) Root File in DER or PEM (X.509).
    • Private Key Path: The path to the .key private key file that belongs to your Server Certificate Path in DER or PEM (PKCS#8) format.
  2. Enter the number of seconds to wait before sending TCP keepalive probes on an idle connection.

To authenticate the Observability Pipelines Worker for Google Chronicle, contact your Google Security Operations representative for a Google Developer Service Account Credential. This credential is a JSON file and must be placed under DD_OP_DATA_DIR/config. See Getting API authentication credential for more information.

Note: If you are installing the Worker in Kubernetes, see Referencing files in Kubernetes for information on how to reference the credentials file.

To set up the Worker’s Google Chronicle destination:

  1. Enter the customer ID for your Google Chronicle instance.
  2. Enter the path to the credentials JSON file you downloaded earlier.
  3. Select JSON or Raw encoding in the dropdown menu.
  4. Select the appropriate Log Type in the dropdown menu.

Note: Logs sent to the Google Chronicle destination must have ingestion labels. For example, if the logs are from a A10 load balancer, it must have the ingestion label A10_LOAD_BALANCER. See Google Cloud’s Support log types with a default parser for a list of available log types and their respective ingestion labels.

The following fields are optional:

  1. Enter the name for the Elasticsearch index.
  2. Enter the Elasticsearch version.

Optionally, enter the name of the OpenSearch index.

  1. Optionally, enter the name of the Amazon OpenSearch index.
  2. Select an authentication strategy, Basic or AWS. For AWS, enter the AWS region.

Set up processors

There are pre-selected processors added to your processor group out of the box. You can add additional processors or delete any existing ones based on your processing needs.

Processor groups are executed from top to bottom. The order of the processors is important because logs are checked by each processor, but only logs that match the processor’s filters are processed. To modify the order of the processors, use the drag handle on the top left corner of the processor you want to move.

Filter query syntax

Each processor has a corresponding filter query in their fields. Processors only process logs that match their filter query. And for all processors except the filter processor, logs that do not match the query are sent to the next step of the pipeline. For the filter processor, logs that do not match the query are dropped.

For any attribute, tag, or key:value pair that is not a reserved attribute, your query must start with @. Conversely, to filter reserved attributes, you do not need to append @ in front of your filter query.

For example, to filter out and drop status:info logs, your filter can be set as NOT (status:info). To filter out and drop system-status:info, your filter must be set as NOT (@system-status:info).

Filter query examples:

  • NOT (status:debug): This filters for only logs that do not have the status DEBUG.
  • status:ok service:flask-web-app: This filters for all logs with the status OK from your flask-web-app service.
    • This query can also be written as: status:ok AND service:flask-web-app.
  • host:COMP-A9JNGYK OR host:COMP-J58KAS: This filter query only matches logs from the labeled hosts.
  • @user.status:inactive: This filters for logs with the status inactive nested under the user attribute.

Learn more about writing filter queries in Datadog’s Log Search Syntax.

Add processors

Enter the information for the processors you want to use. Click the Add button to add additional processors. To delete a processor, click the kebab on the right side of the processor and select Delete.

The log processors available

This processor filters for logs that match the specified filter query and drops all non-matching logs. If a log is dropped at this processor, then none of the processors below this one receives that log. This processor can filter out unnecessary logs, such as debug or warning logs.

To set up the filter processor:

  • Define a filter query. The query you specify filters for and passes on only logs that match it, dropping all other logs.

The remap processor can add, drop, or rename fields within your individual log data. Use this processor to enrich your logs with additional context, remove low-value fields to reduce volume, and standardize naming across important attributes. Select add field, drop field, or rename field in the dropdown menu to get started.

Add field

Use add field to append a new key-value field to your log.

To set up the add field processor:

  1. Define a filter query. Only logs that match the specified filter query are processed. All logs, regardless of whether they do or do not match the filter query, are sent to the next step in the pipeline.
  2. Enter the field and value you want to add. To specify a nested field for your key, use the path notation: <OUTER_FIELD>.<INNER_FIELD>. All values are stored as strings. Note: If the field you want to add already exists, the Worker throws an error and the existing field remains unchanged.
Drop field

Use drop field to drop a field from logging data that matches the filter you specify below.

To set up the drop field processor:

  1. Define a filter query. Only logs that match the specified filter query are processed. All logs, regardless of whether they do or do not match the filter query, are sent to the next step in the pipeline.
  2. Enter the key of the field you want to drop. To specify a nested field for your specified key, use the path notation: <OUTER_FIELD>.<INNER_FIELD>. Note: If your specified key does not exist, your log will be unimpacted.
Rename field

Use rename field to rename a field within your log.

To set up the rename field processor:

  1. Define a filter query. Only logs that match the specified filter query are processed. All logs, regardless of whether they do or do not match the filter query, are sent to the next step in the pipeline.
  2. Enter the name of the field you want to rename in the Source field. To specify a nested field for your key, use the path notation: <OUTER_FIELD>.<INNER_FIELD>. Once renamed, your original field is deleted unless you enable the Preserve source tag checkbox described below.
    Note: If the source key you specify doesn’t exist, a default null value is applied to your target.
  3. In the Target field, enter the name you want the source field to be renamed to. To specify a nested field for your specified key, use the path notation: <OUTER_FIELD>.<INNER_FIELD>.
    Note: If the target field you specify already exists, the Worker throws an error and does not overwrite the existing target field.
  4. Optionally, check the Preserve source tag box if you want to retain the original source field and duplicate the information from your source key to your specified target key. If this box is not checked, the source key is dropped after it is renamed.
Path notation example

For the following message structure, use outer_key.inner_key.double_inner_key to refer to the key with the value double_inner_value.

{
    "outer_key": {
        "inner_key": "inner_value",
            "a": {
                    "double_inner_key": "double_inner_value",
                    "b": "b value"
                },
            "c": "c value"
        },
        "d": "d value"
    }

This processor samples your logging traffic for a representative subset at the rate that you define, dropping the remaining logs. As an example, you can use this processor to sample 20% of logs from a noisy non-critical service.

The sampling only applies to logs that match your filter query and does not impact other logs. If a log is dropped at this processor, none of the processors below receives that log.

To set up the sample processor:

  1. Define a filter query. Only logs that match the specified filter query are sampled at the specified retention rate below. The sampled logs and the logs that do not match the filter query are sent to the next step in the pipeline.
  2. Set the retain field with your desired sampling rate expressed as a percentage. For example, entering 2 means 2% of logs are retained out of all the logs that match the filter query.

This processor parses logs using the grok parsing rules that are available for a set of sources. The rules are automatically applied to logs based on the log source. Therefore, logs must have a source field with the source name. If this field is not added when the log is sent to the Observability Pipelines Worker, you can use the Add field processor to add it.

If the source field of a log matches one of the grok parsing rule sets, the log’s message field is checked against those rules. If a rule matches, the resulting parsed data is added in the message field as a JSON object, overwriting the original message.

If there isn’t a source field on the log, or no rule matches the log message, then no changes are made to the log and it is sent to the next step in the pipeline.

To set up the grok parser:

  1. Define a filter query. Only logs that match the specified filter query are processed. All logs, regardless of whether they do or do not match the filter query, are sent to the next step in the pipeline.
  2. Click the Preview Rules button.
  3. Search or select a source in the dropdown menu to see the grok parsing rules for that source.

The quota processor measures the logging traffic for logs that match the filter you specify. When the configured daily quota is met inside the 24-hour rolling window, the processor can either drop additional logs or send an alert using a Datadog monitor. You can configure the processor to track the total volume or the total number of events.

As an example, you can configure this processor to drop new logs or trigger an alert without dropping logs after the processor has received 10 million events from a certain service in the last 24 hours.

To set up the quota processor:

  1. Enter a name for the quota processor. The pipeline uses the name to identify the quota across multiple Remote Configuration deployments of the Worker.
  2. Define a filter query. Only logs that match the specified filter query are counted towards the daily limit.
    • Logs that match the quota filter and are within the daily quota are sent to the next step in the pipeline.
    • Logs that do not match the quota filter are sent to the next step of the pipeline.
  3. In the Unit for quota dropdown menu, select if you want to measure the quota by the number of Events or by the Volume in bytes.
  4. Set the daily quota limit and select the unit of magnitude for your desired quota.
  5. Check the Drop events checkbox if you want to drop all events when your quota is met. Leave it unchecked if you plan to set up a monitor that sends an alert when the quota is met.
    • If logs that match the quota filter are received after the daily quota has been met and the Drop events option is selected, then those logs are dropped. In this case, only logs that did not match the filter query are sent to the next step in the pipeline.
    • If logs that match the quota filter are received after the daily quota has been met and the Drop events option is not selected, then those logs and the logs that did not match the filter query are sent to the next step in the pipeline.

The reduce processor groups multiple log events into a single log, based on the fields specified and the merge strategies selected. Logs are grouped at 10-second intervals. After the interval has elapsed for the group, the reduced log for that group is sent to the next step in the pipeline.

To set up the reduce processor:

  1. Define a filter query. Only logs that match the specified filter query are processed. Reduced logs and logs that do not match the filter query are sent to the next step in the pipeline.
  2. In the Group By section, enter the field you want to group the logs by.
  3. Click Add Group by Field to add additional fields.
  4. In the Merge Strategy section:
    • In On Field, enter the name of the field you want to merge the logs on.
    • Select the merge strategy in the Apply dropdown menu. This is the strategy used to combine events. See the following Merge strategies section for descriptions of the available strategies.
    • Click Add Merge Strategy to add additional strategies.
Merge strategies

These are the available merge strategies for combining log events.

NameDescription
ArrayAppends each value to an array.
ConcatConcatenates each string value, delimited with a space.
Concat newlineConcatenates each string value, delimited with a newline.
Concat rawConcatenates each string value, without a delimiter.
DiscardDiscards all values except the first value that was received.
Flat uniqueCreates a flattened array of all unique values that were received.
Longest arrayKeeps the longest array that was received.
MaxKeeps the maximum numeric value that was received.
MinKeeps the minimum numeric value that was received.
RetainDiscards all values except the last value that was received. Works as a way to coalesce by not retaining `null`.
Shortest arrayKeeps the shortest array that was received.
SumSums all numeric values that were received.

The deduplicate processor removes copies of data to reduce volume and noise. It caches 5,000 messages at a time and compares your incoming logs traffic against the cached messages. For example, this processor can be used to keep only unique warning logs in the case where multiple identical warning logs are sent in succession.

To set up the deduplicate processor:

  1. Define a filter query. Only logs that match the specified filter query are processed. Deduped logs and logs that do not match the filter query are sent to the next step in the pipeline.
  2. In the Type of deduplication dropdown menu, select whether you want to Match on or Ignore the fields specified below.
    • If Match is selected, then after a log passes through, future logs that have the same values for all of the fields you specify below are removed.
    • If Ignore is selected, then after a log passes through, future logs that have the same values for all of their fields, except the ones you specify below, are removed.
  3. Enter the fields you want to match on, or ignore. At least one field is required, and you can specify a maximum of three fields.
    • Use the path notation <OUTER_FIELD>.<INNER_FIELD> to match subfields. See the Path notation example below.
  4. Click Add field to add additional fields you want to filter on.
Path notation example

For the following message structure, use outer_key.inner_key.double_inner_key to refer to the key with the value double_inner_value.

{
    "outer_key": {
        "inner_key": "inner_value",
            "a": {
                    "double_inner_key": "double_inner_value",
                    "b": "b value"
                },
            "c": "c value"
        },
        "d": "d value"
    }

The Sensitive Data Scanner processor scans logs to detect and redact or hash sensitive information such as PII, PCI, and custom sensitive data. You can pick from our library of predefined rules, or input custom Regex rules to scan for sensitive data.

To set up the sensitive data scanner processor:

  1. Define a filter query. Only logs that match the specified filter query are scanned and processed. All logs, regardless of whether they do or do not match the filter query, are sent to the next step in the pipeline.
  2. Click Add Scanning Rule.
  3. Name your scanning rule.
  4. In the Select scanning rule type field, select whether you want to create a rule from the library or create a custom rule.
    • If you are creating a rule from the library, select the library pattern you want to use.
    • If you are creating a custom rule, enter the regex pattern to check against the data.
  5. In the Scan entire or part of event section, select if you want to scan the Entire Event, Specific Attributes, or Exclude Attributes in the dropdown menu.
    • If you selected Specific Attributes, click Add Field and enter the specific attributes you want to scan. You can add up to three fields. Use path notation (outer_key.inner_key) to access nested keys. For specified attributes with nested data, all nested data is scanned.
    • If you selected Exclude Attributes, click Add Field and enter the specific attributes you want to exclude from scanning. You can add up to three fields. Use path notation (outer_key.inner_key) to access nested keys. For specified attributes with nested data, all nested data is excluded.
  6. In the Define action on match section, select the action you want to take for the matched information. Redaction, partial redaction, and hashing are all irreversible actions.
    • If you are redacting the information, specify the text to replace the matched data.
    • If you are partially redacting the information, specify the number of characters you want to redact and whether to apply the partial redaction to the start or the end of your matched data.
    • Note: If you select hashing, the UTF-8 bytes of the match are hashed with the 64-bit fingerprint of FarmHash.
  7. Optionally, add tags to all events that match the regex, so that you can filter, analyze, and alert on the events.

This processor adds a field with the name of the host that sent the log. For example, hostname: 613e197f3526. Note: If the hostname already exists, the Worker throws an error and does not overwrite the existing hostname.

To set up this processor:

  • Define a filter query. Only logs that match the specified filter query are processed. All logs, regardless of whether they do or do not match the filter query, are sent to the next step in the pipeline.

This processor converts the specified field into JSON objects.

To set up this processor:

  1. Define a filter query. Only logs that match the specified filter query are processed. All logs, regardless of whether they do or do not match the filter query, are sent to the next step in the pipeline.
  2. Enter the name of the field you want to parse JSON on.
    Note: The parsed JSON overwrites what was originally contained in the field.

Use this processor to enrich your logs with information from a reference table, which could be a local file or database.

To set up the enrichment table processor:

  1. Define a filter query. Only logs that match the specified filter query are processed. All logs, regardless of whether they do or do not match the filter query, are sent to the next step in the pipeline.
  2. Enter the source attribute of the log. The source attribute’s value is what you want to find in the reference table.
  3. Enter the target attribute. The target attribute’s value stores, as a JSON object, the information found in the reference table.
  4. Select the type of reference table you want to use, File or GeoIP.
    • For the File type:
      1. Enter the file path.
      2. Enter the column name. The column name in the enrichment table is used for matching the source attribute value. See the Enrichment file example.
        Note: If you are installing the Worker in Kubernetes, see Referencing files in Kubernetes for information on how to reference the file.
    • For the GeoIP type, enter the GeoIP path.
Enrichment file example

For this example, merchant_id is used as the source attribute and merchant_info as the target attribute.

This is the example reference table that the enrichment processor uses:

merch_idmerchant_namecitystate
803Andy’s OttomansBoiseIdaho
536Cindy’s CouchesBoulderColorado
235Debra’s BenchesLas VegasNevada

merch_id is set as the column name the processor uses to find the source attribute’s value. Note: The source attribute’s value does not have to match the column name.

If the enrichment processor receives a log with "merchant_id":"536":

  • The processor looks for the value 536 in the reference table’s merch_id column.
  • After it finds the value, it adds the entire row of information from the reference table to the merchant_info attribute as a JSON object:
merchant_info {
    "merchant_name":"Cindy's Couches",
    "city":"Boulder",
    "state":"Colorado"
}

Install the Observability Pipelines Worker

  1. Select your platform in the Choose your installation platform dropdown menu.

  2. Enter the full path of the HTTP/S endpoint URL. For example, https://127.0.0.8/logs. The Observability Pipelines Worker collects logs events from this endpoint.

  3. Provide the environment variables for each of your selected destinations. See prerequisites for more information.

    Enter the AWS access key ID and AWS secret access key for the S3 archive bucket you created earlier.

    There are no environment variables to configure.

    Enter the Azure connection string you created earlier. The connection string gives the Worker access to your Azure Storage bucket.

    To get the connection string:

    1. Navigate to Azure Storage accounts.
    2. Click Access keys under Security and networking in the left navigation menu.
    3. Copy the connection string for the storage account and paste it into the Azure connection string field on the Observability Pipelines Worker installation page.

    There are no environment variables to configure for Datadog Log Management.

    Enter your Splunk HEC token and the base URL of the Splunk instance. See prerequisites for more information.

    The Worker passes the HEC token to the Splunk collection endpoint. After the Observability Pipelines Worker processes the logs, it sends the logs to the specified Splunk instance URL.

    Note: The Splunk HEC destination forwards all logs to the /services/collector/event endpoint regardless of whether you configure your Splunk HEC destination to encode your output in JSON or raw.

    Enter the Sumo Logic HTTP collector URL. See prerequisites for more information.

    Enter the rsyslog or syslog-ng endpoint URL. For example, 127.0.0.1:9997. The Observability Pipelines Worker sends logs to this address and port.

    Enter the Google Chronicle endpoint URL. For example, https://chronicle.googleapis.com.

    1. Enter the Elasticsearch authentication username.
    2. Enter the Elasticsearch authentication password.
    3. Enter the Elasticsearch endpoint URL. For example, http://CLUSTER_ID.LOCAL_HOST_IP.ip.es.io:9200.
    1. Enter the OpenSearch authentication username.
    2. Enter the OpenSearch authentication password.
    3. Enter the OpenSearch endpoint URL. For example, http://<hostname.IP>:9200.
    1. Enter the Amazon OpenSearch authentication username.
    2. Enter the Amazon OpenSearch authentication password.
    3. Enter the Amazon OpenSearch endpoint URL. For example, http://<hostname.IP>:9200.

  4. Follow the instructions for your environment to install the Worker.

    1. Click Select API key to choose the Datadog API key you want to use.
    2. Run the command provided in the UI to install the Worker. The command is automatically populated with the environment variables you entered earlier.
      docker run -i -e DD_API_KEY=<DATADOG_API_KEY> \
          -e DD_OP_PIPELINE_ID=<PIPELINE_ID> \
          -e DD_SITE=<DATADOG_SITE> \
          -e <SOURCE_ENV_VARIABLE> \
          -e <DESTINATION_ENV_VARIABLE> \
          -p 8088:8088 \
          datadog/observability-pipelines-worker run
      
      Note: By default, the docker run command exposes the same port the Worker is listening on. If you want to map the Worker’s container port to a different port on the Docker host, use the -p | --publish option in the command:
      -p 8282:8088 datadog/observability-pipelines-worker run
      
    3. Navigate back to the Observability Pipelines installation page and click Deploy.

    See Update Existing Pipelines if you want to make changes to your pipeline’s configuration.

    1. Download the Helm chart values file for Amazon EKS.
    2. Click Select API key to choose the Datadog API key you want to use.
    3. Add the Datadog chart repository to Helm:
      helm repo add datadog https://helm.datadoghq.com
      
      If you already have the Datadog chart repository, run the following command to make sure it is up to date:
      helm repo update
      
    4. Run the command provided in the UI to install the Worker. The command is automatically populated with the environment variables you entered earlier.
      helm upgrade --install opw \
      -f aws_eks.yaml \
      --set datadog.apiKey=<DATADOG_API_KEY> \
      --set datadog.pipelineId=<PIPELINE_ID> \
      --set <SOURCE_ENV_VARIABLES> \
      --set <DESTINATION_ENV_VARIABLES> \
      --set service.ports[0].protocol=TCP,service.ports[0].port=<SERVICE_PORT>,service.ports[0].targetPort=<TARGET_PORT> \
      datadog/observability-pipelines-worker
      
      Note: By default, the Kubernetes Service maps incoming port <SERVICE_PORT> to the port the Worker is listening on (<TARGET_PORT>). If you want to map the Worker’s pod port to a different incoming port of the Kubernetes Service, use the following service.ports[0].port and service.ports[0].targetPort values in the command:
      --set service.ports[0].protocol=TCP,service.ports[0].port=8088,service.ports[0].targetPort=8282
      
    5. Navigate back to the Observability Pipelines installation page and click Deploy.

    See Update Existing Pipelines if you want to make changes to your pipeline’s configuration.

    1. Download the Helm chart values file for Azure AKS.
    2. Click Select API key to choose the Datadog API key you want to use.
    3. Add the Datadog chart repository to Helm:
      helm repo add datadog https://helm.datadoghq.com
      
      If you already have the Datadog chart repository, run the following command to make sure it is up to date:
      helm repo update
      
    4. Run the command provided in the UI to install the Worker. The command is automatically populated with the environment variables you entered earlier.
      helm upgrade --install opw \
      -f azure_aks.yaml \
      --set datadog.apiKey=<DATADOG_API_KEY> \
      --set datadog.pipelineId=<PIPELINE_ID> \
      --set <SOURCE_ENV_VARIABLES> \
      --set <DESTINATION_ENV_VARIABLES> \
      --set service.ports[0].protocol=TCP,service.ports[0].port=<SERVICE_PORT>,service.ports[0].targetPort=<TARGET_PORT> \
      datadog/observability-pipelines-worker
      
      Note: By default, the Kubernetes Service maps incoming port <SERVICE_PORT> to the port the Worker is listening on (<TARGET_PORT>). If you want to map the Worker’s pod port to a different incoming port of the Kubernetes Service, use the following service.ports[0].port and service.ports[0].targetPort values in the command:
      --set service.ports[0].protocol=TCP,service.ports[0].port=8088,service.ports[0].targetPort=8282
      
    5. Navigate back to the Observability Pipelines installation page and click Deploy.

    See Update Existing Pipelines if you want to make changes to your pipeline’s configuration.

    1. Download the Helm chart values file for Google GKE.
    2. Click Select API key to choose the Datadog API key you want to use.
    3. Add the Datadog chart repository to Helm:
      helm repo add datadog https://helm.datadoghq.com
      
      If you already have the Datadog chart repository, run the following command to make sure it is up to date:
      helm repo update
      
    4. Run the command provided in the UI to install the Worker. The command is automatically populated with the environment variables you entered earlier.
      helm upgrade --install opw \
      -f google_gke.yaml \
      --set datadog.apiKey=<DATADOG_API_KEY> \
      --set datadog.pipelineId=<PIPELINE_ID> \
      --set <SOURCE_ENV_VARIABLES> \
      --set <DESTINATION_ENV_VARIABLES> \
      --set service.ports[0].protocol=TCP,service.ports[0].port=<SERVICE_PORT>,service.ports[0].targetPort=<TARGET_PORT> \
      datadog/observability-pipelines-worker
      
      Note: By default, the Kubernetes Service maps incoming port <SERVICE_PORT> to the port the Worker is listening on (<TARGET_PORT>). If you want to map the Worker’s pod port to a different incoming port of the Kubernetes Service, use the following service.ports[0].port and service.ports[0].targetPort values in the command:
      --set service.ports[0].protocol=TCP,service.ports[0].port=8088,service.ports[0].targetPort=8282
      
    5. Navigate back to the Observability Pipelines installation page and click Deploy.

    See Update Existing Pipelines if you want to make changes to your pipeline’s configuration.

    1. Click Select API key to choose the Datadog API key you want to use.

    2. Run the one-step command provided in the UI to install the Worker.

      Note: The environment variables used by the Worker in /etc/default/observability-pipelines-worker are not updated on subsequent runs of the install script. If changes are needed, update the file manually and restart the Worker.

    If you prefer not to use the one-line installation script, follow these step-by-step instructions:

    1. Set up APT transport for downloading using HTTPS:
      sudo apt-get update
      sudo apt-get install apt-transport-https curl gnupg
      
    2. Run the following commands to set up the Datadog deb repo on your system and create a Datadog archive keyring:
      sudo sh -c "echo 'deb [signed-by=/usr/share/keyrings/datadog-archive-keyring.gpg] https://apt.datadoghq.com/ stable observability-pipelines-worker-2' > /etc/apt/sources.list.d/datadog-observability-pipelines-worker.list"
      sudo touch /usr/share/keyrings/datadog-archive-keyring.gpg
      sudo chmod a+r /usr/share/keyrings/datadog-archive-keyring.gpg
      curl https://keys.datadoghq.com/DATADOG_APT_KEY_CURRENT.public | sudo gpg --no-default-keyring --keyring /usr/share/keyrings/datadog-archive-keyring.gpg --import --batch
      curl https://keys.datadoghq.com/DATADOG_APT_KEY_06462314.public | sudo gpg --no-default-keyring --keyring /usr/share/keyrings/datadog-archive-keyring.gpg --import --batch
      curl https://keys.datadoghq.com/DATADOG_APT_KEY_F14F620E.public | sudo gpg --no-default-keyring --keyring /usr/share/keyrings/datadog-archive-keyring.gpg --import --batch
      curl https://keys.datadoghq.com/DATADOG_APT_KEY_C0962C7D.public | sudo gpg --no-default-keyring --keyring /usr/share/keyrings/datadog-archive-keyring.gpg --import --batch
      
    3. Run the following commands to update your local apt repo and install the Worker:
      sudo apt-get update
      sudo apt-get install observability-pipelines-worker datadog-signing-keys
      
    4. Add your keys, site (for example, datadoghq.com for US1), source, and destination environment variables to the Worker’s environment file:
      sudo cat <<EOF > /etc/default/observability-pipelines-worker
      DD_API_KEY=<DATADOG_API_KEY>
      DD_OP_PIPELINE_ID=<PIPELINE_ID>
      DD_SITE=<DATADOG_SITE>
      <SOURCE_ENV_VARIABLES>
      <DESTINATION_ENV_VARIABLES>
      EOF
      
    5. Start the worker:
      sudo systemctl restart observability-pipelines-worker
      

    See Update Existing Pipelines if you want to make changes to your pipeline’s configuration.

    1. Click Select API key to choose the Datadog API key you want to use.

    2. Run the one-step command provided in the UI to install the Worker.

      Note: The environment variables used by the Worker in /etc/default/observability-pipelines-worker are not updated on subsequent runs of the install script. If changes are needed, update the file manually and restart the Worker.

    If you prefer not to use the one-line installation script, follow these step-by-step instructions:

    1. Set up the Datadog rpm repo on your system with the below command. Note: If you are running RHEL 8.1 or CentOS 8.1, use repo_gpgcheck=0 instead of repo_gpgcheck=1 in the configuration below.
      cat <<EOF > /etc/yum.repos.d/datadog-observability-pipelines-worker.repo
      [observability-pipelines-worker]
      name = Observability Pipelines Worker
      baseurl = https://yum.datadoghq.com/stable/observability-pipelines-worker-2/\$basearch/
      enabled=1
      gpgcheck=1
      repo_gpgcheck=1
      gpgkey=https://keys.datadoghq.com/DATADOG_RPM_KEY_CURRENT.public
          https://keys.datadoghq.com/DATADOG_RPM_KEY_B01082D3.public
      EOF
      
    2. Update your packages and install the Worker:
      sudo yum makecache
      sudo yum install observability-pipelines-worker
      
    3. Add your keys, site (for example, datadoghq.com for US1), source, and destination environment variables to the Worker’s environment file:
      sudo cat <<-EOF > /etc/default/observability-pipelines-worker
      DD_API_KEY=<API_KEY>
      DD_OP_PIPELINE_ID=<PIPELINE_ID>
      DD_SITE=<SITE>
      <SOURCE_ENV_VARIABLES>
      <DESTINATION_ENV_VARIABLES>
      EOF
      
    4. Start the worker:
      sudo systemctl restart observability-pipelines-worker
      
    5. Navigate back to the Observability Pipelines installation page and click Deploy.

    See Update Existing Pipelines if you want to make changes to your pipeline’s configuration.

    1. Select one of the options in the dropdown to provide the expected log volume for the pipeline:

      OptionDescription
      UnsureUse this option if you are not able to project the log volume or you want to test the Worker. This option provisions the EC2 Auto Scaling group with a maximum of 2 general purpose t4g.large instances.
      1-5 TB/dayThis option provisions the EC2 Auto Scaling group with a maximum of 2 compute optimized instances c6g.large.
      5-10 TB/dayThis option provisions the EC2 Auto Scaling group with a minimum of 2 and a maximum of 5 compute optimized c6g.large instances.
      >10 TB/dayDatadog recommends this option for large-scale production deployments. It provisions the EC2 Auto Scaling group with a minimum of 2 and a maximum of 10 compute optimized c6g.xlarge instances.

      Note: All other parameters are set to reasonable defaults for a Worker deployment, but you can adjust them for your use case as needed in the AWS Console before creating the stack.

    2. Select the AWS region you want to use to install the Worker.

    3. Click Select API key to choose the Datadog API key you want to use.

    4. Click Launch CloudFormation Template to navigate to the AWS Console to review the stack configuration and then launch it. Make sure the CloudFormation parameters are as expected.

    5. Select the VPC and subnet you want to use to install the Worker.

    6. Review and check the necessary permissions checkboxes for IAM. Click Submit to create the stack. CloudFormation handles the installation at this point; the Worker instances are launched, the necessary software is downloaded, and the Worker starts automatically.

    7. Navigate back to the Observability Pipelines installation page and click Deploy.

    See Update Existing Pipelines if you want to make changes to your pipeline’s configuration.