- 필수 기능
- 시작하기
- Glossary
- 표준 속성
- Guides
- Agent
- 통합
- 개방형텔레메트리
- 개발자
- API
- Datadog Mobile App
- CoScreen
- Cloudcraft
- 앱 내
- 서비스 관리
- 인프라스트럭처
- 애플리케이션 성능
- APM
- Continuous Profiler
- 스팬 시각화
- 데이터 스트림 모니터링
- 데이터 작업 모니터링
- 디지털 경험
- 소프트웨어 제공
- 보안
- AI Observability
- 로그 관리
- 관리
When you plan a new software installation, its crucial to understand its capabilities, objectives, timelines, teams, and design patterns. In the plan phase, learn Datadog basics, define your most important objectives, understand best practices, and identify how to optimize your Datadog installation.
Setting a clear goal is critical for installing Datadog. However, in a practical world, it is impossible to predict everything you might need at the outset. Product engineers iterate their rollouts and systems operators control their changes, all to control risk. Implementing a large-scale Datadog installation will benefit from standard project management practices. As part of that process, there are certain Datadog elements that you should include. Send your engineering organization a survey outline to size and whiteboard your needs.
Recommendations:
Start collecting and consolidating a survey of your organization early. Create a comprehensive view of your ecosystems, application languages, data storage, networking, and infrastructure.
A sample survey form might look like this:
Application name:
Language:
Frameworks:
Model layer:
View layer:
Controller layer:
Infra type:
Operating systems:
Complete the scoping exercise to understand the types of technologies you’re working with, and start mapping those to core products in Datadog.
Datadog is a tool for correlating machine data with the running applications and its physical descriptors. It can cross-reference an individual piece of data against others, regardless of type. Hostname, cloud regions, operating system version, and IP are just some of the automatically applied resource attributes. Additionally, Datadog allows you to generate custom tags such as cost-code
, AppName
, environment
, and version
.
Datadog’s strength lies in its capability to maintain and manage a unified vocabulary and includes built-in data features. Unified Service Tagging uses reserved tags that enable telemetry correlation across all features of the Datadog platform.
Tags are key:value
pairs or simple values. They add dimension to application performance data and infrastructure metrics. Before you begin monitoring with Datadog, take advantage of the tagging capabilities that your platforms offer, as Datadog automatically imports these tags through its integrations. The following table is a representation of how key:value
pairs work and whether the tags are added automatically or manually.
TAG | KEY | VALUE | METHOD |
---|---|---|---|
env:staging | env | staging | automatic |
component_type:database | component_type | database | manual |
region:us-west-1 | region | us-west-1 | automatic |
The Getting started with tagging guide is a great place to start with this topic. Here are some additional highlights:
Recommendations:
The following diagram depicts how each of the three reserved tags may look as you are building out your environment:
At the architectural design level, there are two main areas of access control within Datadog: organization structure, and role-based access control (RBAC).
Datadog role-based access control can connect to your existing SAML authentication service. SAML group-mappings can be built against the Datadog default roles and team objects. Datadog provides three default roles, which you can customize to match the complexity of your own AD/LDAP Roles. You can also set up service accounts for non-interactive purposes like API and App Key ownership, to separate user activities from system tasks. Granular permissions set the access and protections you need.
As an additional layer, Teams lets you set up flexible, informal, and ad-hoc groups that users can join or be added to. The Teams feature is available throughout Datadog products.
Larger Datadog customers often have more than one Datadog installation. This is typically used by managed service providers to ensure that their customers do not have access to one another’s data. In some cases, full isolation within a single company is necessary. To accommodate this topology, you can manage multiple organizational accounts. For example, you can view total usage at the parent level, while remaining completely separate technologically. Child organizations should be managed from a single parent organization account.
Recommendations:
APM depends on the application of Unified Service Tagging. These tags are pivotal to the operational experience, and are also useful for enabling correlation across your telemetry data. This is how Datadog can help determine the owner for a random Java process it discovers.
Usually, the default APM setup is sufficient for most use cases, but if, for example, you want to change sampling rates or to customize other APM configurations, use the following guidelines.
Recommendations:
Log Management capabilities allow you and your teams to diagnose and fix your infrastructure issues. With Logging without Limits™ you can create tunable log collection patterns and extract information from your log data into custom metrics. You can also be alerted to critical errors in your logs, without needing to index (that is, store) them.
Datadog’s log index architecture is a distributed, timeseries, columnar store that is optimized for serving large scan and aggregation queries. See Introducing Husky and Husky Deep Dive for more information about Datadog’s logging architecture.
The logging platform can be configured with many layers of logs storage. Each has its own use-cases outlined here:
Data captured | Retention | Retrieval to Datadog | Cost Driver | |
---|---|---|---|---|
Ignored | No | None | None | N/A |
Ingested | Logs-to-metrics | 15m in LiveTail | None | Volume |
Archive | Upon rehydrate | Infinite | Slow | Volume |
Forward logs | Logs-to-metrics | Determined by target | None | Volume |
Index | Yes | 3,7,15,30 days | Immediate | Volume & message count |
Flex Logs | Yes* | Storage tiers | Rapid | Retrieval volume |
* Flex Logs capability does not include monitors/alerting and Watchdog. However, you can still perform log-to-metrics on the ingestion stream before logs are indexed (in either standard or flex) and use those metrics in monitors. Correlation with other telemetry, such as traces, is supported.
With these base functions, you can ingest and monitor logs for some of the following use-cases:
Recommendations:
Real User Monitoring and Session Replay give detailed insights into the experiences of your service or application end-user. Install RUM on applications with high value sessions to leverage the data for meaningful changes. Session Replay provides a visual representation that is invaluable for troubleshooting issues observed by users. You can track the actual customer experience, which is most valuable in production environments.
Recommendations:
Synthetic Monitoring is a full synthetic testing suite, which includes testing for browser applications, mobile apps, and APIs. Synthetic test results can be linked back to the application behavior it generated, and from there, into the database, message queues, and downstream services.
Recommendations:
You can use Datadog’s
800+ integrations to bring together all of the metrics and logs from your infrastructure, to gain insights into an entire observability system. The integrations, either SaaS-based or Agent-based, collect metrics to monitor within Datadog. Host-based integrations are configured with yaml files that live in the conf.d
directory, and container-based integrations are configured with metadata such as annotations or labels.
There are different types of integrations in Datadog, and the order in which they are presented here is the order Datadog recommends for their installation.
Category | Description |
---|---|
Cloud Technologies (AWS, Google Cloud, Azure) | These integrations use provisioned credentials to scrape monitoring endpoints for metrics. Fine-tune these to ingest only desired telemetry. |
Incident Response (Slack, Jira, PagerDuty) | These integrations send notifications when events occur and are vital for establishing an efficient alerting system. |
Infrastructure (orchestration, operating system, network) | These integrations serve as the foundational components for monitoring your infrastructure, gathering both metrics and logs. |
Data Layers (data stores, message queues) | These integrations usually query internal DB metrics tables, so this usually requires a database administrator to provide access for the Agent. |
Development (automation, languages, source control) | These integrations push metrics to Datadog and require configuration on their end. Some may require DogStatsD to ship metrics. |
Security and Compliance (Okta, Open Policy Agent) | These integrations enable you to verify compliance with your standards. |
Recommendations:
You’ve achieved some important wins and adopted best practices with APM, RUM, Synthetic Monitoring and Log Management. Some additional resources that are important when planning your installation phase are outlined below.
Use Live processes to view all of your running processes in one place. For example, see PID-level information of a running Apache process, to help you understand and troubleshoot transient issues. Additionally, you can query for processes running on a specific host, in a specific availability zone, or running a specific workload. Configure live processes on your hosts, containers, or AWS ECS Fargate instances to take advantage of this feature.
Web server operations depend on the network availability of ports, the validity of SSL certificates, and low latencies. Install the HTTP_Check to monitor local or remote HTTP endpoints, detect bad response codes (such as 404), and use Synthetic API tests to identify soon-to-expire SSL certificates.
Web servers are almost always inter-connected with other services through a network fabric that is vulnerable to drops and can result in re-transmits. Use Datadog’s network integration and enable Network Performance Monitoring to gain visibility into your network traffic between services, containers, availability zones, and other tags on your infrastructure.
Datadog infrastructure monitoring comes with additional products that you can use to maximize observability of your environments.
Service catalog provides an overview of services, showing which were recently deployed, which haven’t been deployed for a while, which services report the most errors, and those with on-going incidents, and much more.
Service Catalog also helps you evaluate the coverage of your observability setup. As you continue your roll out, you can check in on the Setup Guidance tab of each of your services, to ensure that they have the expected configurations:
You can add components that you aren’t planning on monitoring immediately, such as cron jobs or libraries, to create a comprehensive view of your system, and to mark team members who are responsible for these components ahead of the next phase of your Datadog rollout.
Use Resource Catalog to view key resource information such as metadata, ownership, configurations, relationships between assets, and active security risks. It is the central hub of all of your infrastructure resources. Resource Catalog offers visibility into infrastructure compliance, promotes good tagging practices, reduces application risks by identifying security vulnerabilities, provides engineering leadership with a high-level view of security practices, and allows resource export for record-keeping or auditing.
You can use Resource Catalog in a variety of contexts, including:
Use API Catalog for resource endpoint-specific categorization, performance, reliability, and ownership of all your API endpoints in one place.
Without any additional setup, Event management can show third-party event statuses, events generated from the Agent and installed integrations. Datadog Event Management centralizes third-party events, such as alerts and change events. Datadog also automatically creates events from various products including monitors and Error Tracking. You can also use Event Management to send monitor alerts and notifications based on event queries.
See errors where they happen with Datadog’s Error Tracking. Error Tracking can ingest errors from APM, Log Management, and Real User Monitoring to debug issues faster.
Centrally administer and manage all of your Datadog Agents with Fleet Automation. Fleet Automation can identify which Agents need upgrading, send a flare to support, and help in the task of disabling or rotating API keys.
Use Datadog’s Remote Configuration (enabled by default), to remotely configure and change the behavior of Datadog components (for example, Agents, tracing libraries, and Observability Pipelines Worker) deployed in your infrastructure. For more information, see supported products and capabilities.
Use Datadog Notebooks to share information with team members and to aid troubleshooting investigations or incidents.
Datadog collects many things in your environments, so it is important to limit the amount of collection points and establish guard rails. In this section, you’ll learn the mechanisms that control the telemetry collection, and discuss how you can codify your organization’s needs.
Datadog interacts with the monitoring API of HyperVisor managers (Hyper-V, vSphere, PCF), container schedulers (Kubernetes,Rancher, Docker), and public cloud providers (AWS, Azure, GCP). The platform to autodiscover resources (pods, VMs, EC2s, ALBs, AzureSQL, GCP blobs) within those environments. It is important to limit the number of monitored resources, because they have billing implications.
Recommendations:
Enable resource collection for AWS and GCP to view an inventory of resources, as well as cost and security insights. Additionally, limit metric collection for your Azure resources and your containerized environments.
During the planning phase, you may find that not all instances of observability are equally important. Some are mission-critical, while others are not. For this reason, it is useful to devise patterns for coverage levels, and which environments you want to monitor with Datadog. For example, a production environment might be monitored at every level, but a development instance of the same application might only have the developer-focused tooling.
Recommendations:
To begin mapping out your installation patterns, combine the information you gathered from the scoping exercise with the Datadog 101 training, and develop a plan of action. Consider the following example, and modify it to suit your organization’s needs. The example outlines an installation pattern from the dimension of SDLC environment (dev, qa, stage, prod), and you can customize it to your standards and conditions. Begin setting expectations within your own Datadog user base what “Standard Datadog installation” means.
Dev | QA | Staging | Prod | |
---|---|---|---|---|
APM | Deny/Allow | Allow | Allow | Allow |
Synthetics | Deny | Deny/Allow | Allow | Allow |
Log Management | Allow | Allow | Allow | Allow |
RUM | Deny | Deny/Allow | Allow | Allow |
DBM | Deny/Allow | Deny/Allow | Allow | Allow |
Live Processes | Deny | Deny/Allow | Allow | Allow |
Recommendations : Not every tool suits every job. Evaluate Datadog’s product use cases and match them with your needs. Consider the SDLC levels, application importance, and the purpose of each Datadog product.
It is important to develop and plan a realistic course for installing Datadog. In this section, you learned about the planning and best practices phase, setting your Datadog footprint up for success. You identified and assembled your knowledge base and team members, developed your installation models, planned optimizations, and compiled a list of best practices for core products. These foundations prepare you for the the next phases of Datadog installation: build and run.
Create a detailed roll-out methodology in the build phase by focusing on the construction of Datadog itself, iterate on your environment, establish some internal support mechanisms, and prepare for production.