External Iceberg Tables

Overview

Embucket can connect to existing Apache Iceberg tables created by other tools like Spark, Trino, Dremio, or Flink. This enables seamless integration with your existing data lake infrastructure.

External Iceberg tables must be on the same storage volume (bucket) defined in your volume configuration. Embucket uses the volume credentials to access table metadata and data files.

When to Use External Iceberg

Use external Iceberg table configuration when:

You have existing Iceberg tables from Spark, Trino, or other engines
You need to share tables across multiple query engines
You want explicit control over which tables are accessible
Your tables are not in an S3 Tables catalog
You need to query tables created by other systems

Configuration Structure

External Iceberg configuration requires defining the storage volume and explicitly registering each table:

volumes:
  - ident: lakehouse
    type: s3
    region: us-east-2
    bucket: YOUR_BUCKET_NAME
    credentials:
      credential_type: access_key
      aws-access-key-id: YOUR_ACCESS_KEY
      aws-secret-access-key: YOUR_SECRET_KEY

databases:
  - ident: demo
    volume: lakehouse

schemas:
  - database: demo
    schema: tpch_10

tables:
  - database: demo
    schema: tpch_10
    table: customer
    metadata_location: s3://YOUR_BUCKET_NAME/tpch_10/customer/metadata/00001-eea1cccb-38a4-4fe2-8c95-c01dae9d0c60.metadata.json
  - database: demo
    schema: tpch_10
    table: lineitem
    metadata_location: s3://YOUR_BUCKET_NAME/tpch_10/lineitem/metadata/00001-d777220e-d508-4033-a229-8c4c8d8fe514.metadata.json

Volume Configuration

S3 Volume Setup

Define an S3 volume that points to the bucket containing your Iceberg tables:

ident

string

required

Unique identifier for the storage volume. Referenced by databases and tables.

type

string

required

Must be s3 for standard S3 storage.

region

string

required

AWS region where the S3 bucket is located (e.g., us-east-2, eu-west-1).

bucket

string

required

Name of the S3 bucket containing your Iceberg tables.

credentials

object

AWS credentials for accessing the S3 bucket. See Credentials below.

endpoint

string

Custom S3 endpoint URL for S3-compatible storage (MinIO, Ceph, etc.). Must start with http:// or https://.

Alternative Storage Types

Embucket also supports other volume types:

# Local filesystem (development)
volumes:
  - ident: local
    type: file
    path: /data/lakehouse

# In-memory (testing)
volumes:
  - ident: temp
    type: memory

Table Registration

Each external Iceberg table must be explicitly registered with its metadata location:

Required Fields

database

string

required

The database this table belongs to. Must match a database defined in the databases section.

schema

string

required

The schema this table belongs to. Must match a schema defined in the schemas section.

table

string

required

The name of the table as it will appear in queries.

metadata_location

string

required

Full S3 URI to the Iceberg metadata JSON file. This file contains the table schema, partition spec, and snapshot information.

Finding Metadata Locations

Iceberg metadata files are typically organized like:

s3://bucket/
└── database_name/
    └── table_name/
        ├── data/
        │   ├── partition1/
        │   └── partition2/
        └── metadata/
            ├── 00000-abc123.metadata.json
            ├── 00001-def456.metadata.json  ← Latest version
            └── snap-*.avro

To find the current metadata file:

AWS CLI
Spark
Manual

aws s3 ls s3://YOUR_BUCKET/path/to/table/metadata/ \
  --recursive | sort | tail -1

# In PySpark
spark.sql("DESCRIBE EXTENDED my_table").filter(
  "col_name = 'Metadata Location'"
).show(truncate=False)

Navigate to your table’s S3 location and look in the metadata/ folder for the highest numbered .metadata.json file.

Credentials

Access Key Authentication

credentials:
  credential_type: access_key
  aws-access-key-id: AKIAIOSFODNN7EXAMPLE
  aws-secret-access-key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

Required IAM Permissions

Your AWS credentials need read access to the S3 bucket:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::YOUR_BUCKET_NAME",
        "arn:aws:s3:::YOUR_BUCKET_NAME/*"
      ]
    }
  ]
}

For write operations (INSERT, UPDATE, DELETE), add:

"Action": [
  "s3:GetObject",
  "s3:PutObject",
  "s3:DeleteObject",
  "s3:ListBucket"
]

Complete Example

Here’s a full configuration for a data lake with multiple tables:

metastore.yaml

volumes:
  - ident: data_lake
    type: s3
    region: us-west-2
    bucket: my-iceberg-tables
    credentials:
      credential_type: access_key
      aws-access-key-id: ${AWS_ACCESS_KEY_ID}
      aws-secret-access-key: ${AWS_SECRET_ACCESS_KEY}

databases:
  - ident: analytics
    volume: data_lake

schemas:
  - database: analytics
    schema: sales
  - database: analytics
    schema: marketing

tables:
  # Sales schema tables
  - database: analytics
    schema: sales
    table: orders
    metadata_location: s3://my-iceberg-tables/sales/orders/metadata/00005-xyz789.metadata.json
  
  - database: analytics
    schema: sales
    table: customers
    metadata_location: s3://my-iceberg-tables/sales/customers/metadata/00003-abc123.metadata.json
  
  # Marketing schema tables
  - database: analytics
    schema: marketing
    table: campaigns
    metadata_location: s3://my-iceberg-tables/marketing/campaigns/metadata/00001-def456.metadata.json

Important Constraints

Same Bucket Requirement: All external Iceberg tables must reside in the same S3 bucket specified in the volume configuration. Tables in different buckets require separate volume definitions.

Multiple Volumes for Multiple Buckets

If you have tables in different buckets:

volumes:
  - ident: lake_us_east
    type: s3
    region: us-east-1
    bucket: bucket-us-east
    credentials:
      credential_type: access_key
      aws-access-key-id: KEY1
      aws-secret-access-key: SECRET1
  
  - ident: lake_eu_west
    type: s3
    region: eu-west-1
    bucket: bucket-eu-west
    credentials:
      credential_type: access_key
      aws-access-key-id: KEY2
      aws-secret-access-key: SECRET2

databases:
  - ident: us_data
    volume: lake_us_east
  - ident: eu_data
    volume: lake_eu_west

Querying External Tables

Once registered, query external Iceberg tables using standard SQL:

-- Query registered table
SELECT * FROM demo.tpch_10.customer LIMIT 10;

-- Join multiple tables
SELECT 
  c.c_name,
  c.c_mktsegment,
  COUNT(l.l_orderkey) as order_count
FROM demo.tpch_10.customer c
JOIN demo.tpch_10.lineitem l ON c.c_custkey = l.l_orderkey
GROUP BY c.c_name, c.c_mktsegment;

-- Use Iceberg time travel
SELECT * FROM demo.tpch_10.customer
FOR SYSTEM_TIME AS OF '2024-01-01 00:00:00';

Metadata Updates

Embucket reads the metadata file specified in the configuration. If the table is updated by another tool (Spark, Trino), you’ll need to update the metadata_location in your configuration and restart Embucket to see changes.

For dynamic metadata discovery, consider using S3 Tables instead, which automatically tracks metadata changes.

Troubleshooting

Table Not Found

If queries fail with “table not found”:

Verify the database, schema, and table names match your configuration exactly
Check that schemas are defined before tables that use them
Confirm the database references a valid volume

Invalid Metadata Location

If Embucket reports invalid metadata:

Verify the S3 URI is correct and accessible
Check that the metadata file exists at the specified location
Ensure credentials have s3:GetObject permission
Confirm the file is valid Iceberg metadata JSON

Metadata Parse Errors

If metadata parsing fails:

Verify the metadata file is from a compatible Iceberg version
Check that the JSON structure is valid
Ensure the file isn’t corrupted or truncated

Next Steps

Write SQL Queries

Learn Snowflake SQL syntax

Metastore Configuration

Complete YAML schema reference

Get Started

Core Concepts

Deployment

Catalogs & Storage

Usage Guides

Operations

External Iceberg Tables

Overview

When to Use External Iceberg

Configuration Structure

Volume Configuration

S3 Volume Setup

Alternative Storage Types

Table Registration

Required Fields

Finding Metadata Locations

Credentials

Access Key Authentication

Required IAM Permissions

Complete Example

Important Constraints

Multiple Volumes for Multiple Buckets

Querying External Tables

Metadata Updates

Troubleshooting

Table Not Found

Invalid Metadata Location

Metadata Parse Errors

Next Steps

Write SQL Queries

Metastore Configuration

Get Started

Core Concepts

Deployment

Catalogs & Storage

Usage Guides

Operations

Documentation Index

​Overview

​When to Use External Iceberg

​Configuration Structure

​Volume Configuration

​S3 Volume Setup

​Alternative Storage Types

​Table Registration

​Required Fields

​Finding Metadata Locations

​Credentials

​Access Key Authentication

​Required IAM Permissions

​Complete Example

​Important Constraints

​Multiple Volumes for Multiple Buckets

​Querying External Tables

​Metadata Updates

​Troubleshooting

​Table Not Found

​Invalid Metadata Location

​Metadata Parse Errors

​Next Steps

Write SQL Queries

Metastore Configuration

Overview

When to Use External Iceberg

Configuration Structure

Volume Configuration

S3 Volume Setup

Alternative Storage Types

Table Registration

Required Fields

Finding Metadata Locations

Credentials

Access Key Authentication

Required IAM Permissions

Complete Example

Important Constraints

Multiple Volumes for Multiple Buckets

Querying External Tables

Metadata Updates

Troubleshooting

Table Not Found

Invalid Metadata Location

Metadata Parse Errors

Next Steps