Iceberg Tables (AWS Glue)

문서 > Data Observability Overview > Quality Monitoring > Data Lake Integrations > Iceberg Tables (AWS Glue)

이 페이지는 아직 한국어로 제공되지 않습니다. 번역 작업 중입니다.
현재 번역 프로젝트에 대한 질문이나 피드백이 있으신 경우 언제든지 연락주시기 바랍니다.

Overview

If you’re using the Iceberg framework in AWS Glue, you can see metadata from your Iceberg tables in Datadog through the AWS Glue integration. Use this data to monitor table schemas, data freshness, row counts, and table sizes.

Prerequisites

Before you begin, make sure you have:

An AWS account with Glue Iceberg tables you want to monitor.
The Datadog AWS integration configured for the account.
IAM permissions to modify the Datadog role’s policies.
(Optional) AWS Lake Formation access if you use it to manage table permissions.

Configure the AWS account

Navigate to Datadog Data Observability > Settings.
Click Configure next to AWS Glue.
Select an existing AWS account that is already connected to Datadog, or add a new one. For help adding a new account, see the AWS integration documentation.

Add required IAM permissions

The Data Observability crawler requires additional permissions to monitor Glue Iceberg tables. Attach the following policy to the Datadog IAM role configured for your AWS integration:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "glue:GetCatalog",
        "glue:GetDatabase",
        "glue:GetDatabases",
        "glue:GetJobRun",
        "glue:GetJobRuns",
        "glue:GetJob",
        "glue:GetJobs",
        "glue:GetTable",
        "glue:GetTables",
        "glue:ListJobs",
        "s3:ListBucket",
        "kms:Decrypt",
        "lakeformation:GetDataAccess"
      ],
      "Resource": ["*"]
    },
    {
      "Sid": "AllowIcebergMetadataOnly",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:GetObjectVersion"
      ],
      "Resource": [
        "arn:aws:s3:::*/metadata/*"
      ]
    }
  ]
}

(Optional) Restrict access to specific databases and tables

The policy above grants access to all Glue resources. To monitor only specific databases or tables, replace Resource: ["*"] in the example policy above with explicit ARNs of the databases or tables to monitor.

AWS Glue IAM permissions are hierarchical. To access a table, the policy must include the catalog, the database, and the table. Omitting any level results in an access denied error.

Resource	ARN format	Example
Catalog	`arn:aws:glue:<REGION>:<ACCOUNT_ID>:catalog`	`arn:aws:glue:us-east-1:123456789012:catalog`
Database	`arn:aws:glue:<REGION>:<ACCOUNT_ID>:database/<DB_NAME>`	`arn:aws:glue:us-east-1:123456789012:database/analytics`
Table	`arn:aws:glue:<REGION>:<ACCOUNT_ID>:table/<DB_NAME>/<TABLE_NAME>`	`arn:aws:glue:us-east-1:123456789012:table/analytics/events`

Example policies

To monitor all tables in specific databases, include the catalog, each database, and a wildcard for tables in those databases:

{
  "Effect": "Allow",
  "Action": [
    "glue:GetCatalog",
    "glue:GetDatabase",
    "glue:GetDatabases",
    "glue:GetTable",
    "glue:GetTables"
  ],
  "Resource": [
    "arn:aws:glue:us-east-1:123456789012:catalog",
    "arn:aws:glue:us-east-1:123456789012:database/production_db",
    "arn:aws:glue:us-east-1:123456789012:database/analytics_db",
    "arn:aws:glue:us-east-1:123456789012:table/production_db/*",
    "arn:aws:glue:us-east-1:123456789012:table/analytics_db/*"
  ]
}

To monitor only specific tables, list each table explicitly. You can also use wildcards to match table name patterns:

{
  "Effect": "Allow",
  "Action": [
    "glue:GetCatalog",
    "glue:GetDatabase",
    "glue:GetDatabases",
    "glue:GetTable",
    "glue:GetTables"
  ],
  "Resource": [
    "arn:aws:glue:us-east-1:123456789012:catalog",
    "arn:aws:glue:us-east-1:123456789012:database/production_db",
    "arn:aws:glue:us-east-1:123456789012:table/production_db/orders",
    "arn:aws:glue:us-east-1:123456789012:table/production_db/customers",
    "arn:aws:glue:us-east-1:123456789012:table/production_db/events_*"
  ]
}

The wildcard events_* matches tables like events_clicks, events_purchases, and any other table starting with events_.

For more information, see the AWS Glue identity-based policy examples.

(Optional) Configure Lake Formation access

If you use AWS Lake Formation to manage access to your Glue Catalog tables, grant the Datadog role access to the databases and tables you want to monitor.

Use the following commands, replacing the placeholder values with your actual account ID, role name, database name, and S3 bucket:

PRINCIPAL=arn:aws:iam::<YOUR_AWS_ACCOUNT_ID>:role/<YOUR_DATADOG_ROLE_NAME>

aws lakeformation grant-permissions \
  --principal DataLakePrincipalIdentifier=$PRINCIPAL \
  --resource '{"Database":{"Name":"<YOUR_DATABASE_NAME>"}}' \
  --permissions DESCRIBE SELECT

aws lakeformation grant-permissions \
  --principal DataLakePrincipalIdentifier=$PRINCIPAL \
  --resource '{"TableWildcard":{"DatabaseName":"<YOUR_DATABASE_NAME>"}}' \
  --permissions DESCRIBE SELECT

aws lakeformation grant-permissions \
  --principal DataLakePrincipalIdentifier=$PRINCIPAL \
  --resource '{"DataLocation":{"ResourceArn":"arn:aws:s3:::<YOUR_S3_BUCKET_NAME>"}}' \
  --permissions DATA_LOCATION_ACCESS

In the AWS Console, navigate to Lake Formation > Data lake permissions.
Click Grant.
Under Principals, select IAM users and roles and choose your Datadog role.
Under LF-Tags or catalog resources, select the database and tables you want to monitor.
Under Permissions, select DESCRIBE and SELECT.
Click Grant.

Lake Formation permissions grant dialog in AWS Console

Configure the crawler

Select the AWS regions where your Glue Iceberg tables are located.
Enable the Quality Monitoring for Apache Iceberg toggle.
(Optional) Enable the Job Monitoring toggle if you also want to monitor Glue job health and performance.
Choose a sync frequency.
(Optional) Enter a catalog name if you use nested Glue catalog features. Leave this field empty for the default catalog.
Click Save.

Next steps

After you complete the setup, Datadog begins syncing your Glue Iceberg table metadata in the background. Initial syncs can take up to an hour depending on the number of tables in your catalog.

After the sync completes, your tables appear in the Data Catalog.