Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/embucket/embucket/llms.txt

Use this file to discover all available pages before exploring further.

The metastore configuration file defines volumes, databases, schemas, and tables to bootstrap when Embucket starts. This YAML file is specified via the --metastore-config flag or METASTORE_CONFIG environment variable.

Schema Structure

volumes:
  - ident: string
    type: s3 | s3tables | file | memory
    database: string (optional)
    should_refresh: boolean (optional)
    # type-specific fields below

databases:
  - ident: string
    volume: string
    should_refresh: boolean (optional)

schemas:
  - database: string
    schema: string

tables:
  - database: string
    schema: string
    table: string
    metadata_location: string

Volumes

Volumes define storage backends for Iceberg tables.
volumes[].ident
string
required
Unique identifier for the volume.
ident: my_s3_bucket
volumes[].type
enum
required
Volume type.Options: s3, s3tables, file, memory
type: s3
volumes[].database
string
Optional database name to auto-create for this volume.
database: analytics
volumes[].should_refresh
boolean
default:"false"
Whether to refresh the volume metadata on startup.
should_refresh: true

S3 Volume

region
string
AWS region for the S3 bucket.
region: us-east-1
bucket
string
S3 bucket name. Must contain only alphanumeric characters, hyphens, or underscores.
bucket: my-data-bucket
endpoint
string
Custom S3 endpoint URL. Must start with http:// or https://.
endpoint: https://s3.amazonaws.com
credentials
object
required
AWS credentials object.
credentials:
  credential_type: access_key
  aws-access-key-id: AKIAIOSFODNN7EXAMPLE
  aws-secret-access-key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

S3 Credentials

Access Key Credentials:
credential_type
string
required
Must be access_key for access key credentials.
aws-access-key-id
string
required
AWS access key ID (20 character alphanumeric string).
aws-secret-access-key
string
required
AWS secret access key (40 character Base64-like string).
aws-session-token
string
Optional AWS session token for temporary credentials.
Token Credentials:
credential_type
string
required
Must be token for token-based credentials.

S3 Tables Volume

arn
string
required
Amazon S3 Tables bucket ARN.Format: arn:aws:s3tables:region:account-id:bucket/bucket-name
arn: arn:aws:s3tables:us-east-1:123456789012:bucket/my-table-bucket
endpoint
string
Custom endpoint URL for S3 Tables. Must start with http:// or https://.
endpoint: https://s3tables.us-east-1.amazonaws.com
credentials
object
required
AWS credentials (same format as S3 volume credentials).
credentials:
  credential_type: access_key
  aws-access-key-id: AKIAIOSFODNN7EXAMPLE
  aws-secret-access-key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

File Volume

path
string
required
Local filesystem path for the volume.
path: /data/iceberg

Memory Volume

Memory volumes have no additional configuration beyond type: memory.

Databases

databases[].ident
string
required
Database identifier name.
ident: analytics
databases[].volume
string
required
Volume identifier to use for this database. Must reference a defined volume.
volume: my_s3_bucket
databases[].should_refresh
boolean
default:"false"
Whether to refresh the database metadata on startup.
should_refresh: true

Schemas

schemas[].database
string
required
Database name for this schema.
database: analytics
schemas[].schema
string
required
Schema name.
schema: production

Tables

Tables can be pre-registered from existing Iceberg metadata.
tables[].database
string
required
Database name for this table.
database: analytics
tables[].schema
string
required
Schema name for this table.
schema: production
tables[].table
string
required
Table name.
table: users
tables[].metadata_location
string
required
S3 or file URL to the Iceberg metadata JSON file.
metadata_location: s3://my-bucket/warehouse/users/metadata/v1.metadata.json

Complete Examples

Basic Memory Volume

volumes:
  - ident: embucket
    type: memory
    database: embucket

databases:
  - ident: embucket
    volume: embucket

schemas:
  - database: embucket
    schema: public

S3 Volume with Access Keys

volumes:
  - ident: production_data
    type: s3
    region: us-west-2
    bucket: my-data-lake
    database: analytics
    credentials:
      credential_type: access_key
      aws-access-key-id: AKIAIOSFODNN7EXAMPLE
      aws-secret-access-key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

databases:
  - ident: analytics
    volume: production_data

schemas:
  - database: analytics
    schema: public
  - database: analytics
    schema: staging

S3 Tables Volume

volumes:
  - ident: s3tables_production
    type: s3tables
    arn: arn:aws:s3tables:us-east-1:123456789012:bucket/prod-iceberg
    database: warehouse
    credentials:
      credential_type: access_key
      aws-access-key-id: AKIAIOSFODNN7EXAMPLE
      aws-secret-access-key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

databases:
  - ident: warehouse
    volume: s3tables_production

schemas:
  - database: warehouse
    schema: sales
  - database: warehouse
    schema: inventory

File Volume

volumes:
  - ident: local_dev
    type: file
    path: /tmp/embucket-data
    database: dev

databases:
  - ident: dev
    volume: local_dev

schemas:
  - database: dev
    schema: testing

Multiple Volumes and Pre-registered Tables

volumes:
  - ident: raw_data
    type: s3
    region: us-east-1
    bucket: raw-data-bucket
    credentials:
      credential_type: access_key
      aws-access-key-id: AKIAIOSFODNN7EXAMPLE
      aws-secret-access-key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

  - ident: processed_data
    type: s3
    region: us-east-1
    bucket: processed-data-bucket
    credentials:
      credential_type: access_key
      aws-access-key-id: AKIAIOSFODNN7EXAMPLE
      aws-secret-access-key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

databases:
  - ident: raw
    volume: raw_data
  - ident: processed
    volume: processed_data

schemas:
  - database: raw
    schema: public
  - database: processed
    schema: public

tables:
  - database: processed
    schema: public
    table: customer_metrics
    metadata_location: s3://processed-data-bucket/warehouse/customer_metrics/metadata/v3.metadata.json

Custom S3 Endpoint (MinIO)

volumes:
  - ident: minio_storage
    type: s3
    region: us-east-1
    bucket: embucket
    endpoint: http://localhost:9000
    credentials:
      credential_type: access_key
      aws-access-key-id: minioadmin
      aws-secret-access-key: minioadmin

databases:
  - ident: local
    volume: minio_storage

schemas:
  - database: local
    schema: public

Session Token Credentials

volumes:
  - ident: temp_bucket
    type: s3
    region: us-west-2
    bucket: temporary-data
    credentials:
      credential_type: access_key
      aws-access-key-id: ASIAIOSFODNN7EXAMPLE
      aws-secret-access-key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
      aws-session-token: FwoGZXIvYXdzEBYaDK...

databases:
  - ident: temp
    volume: temp_bucket

schemas:
  - database: temp
    schema: public

Validation Rules

  • Volume idents must be unique within the configuration
  • Bucket names must only contain alphanumeric characters, hyphens, or underscores
  • Bucket names must not start or end with a hyphen or underscore
  • AWS Access Key IDs must be 20 character alphanumeric strings
  • AWS Secret Access Keys must be 40 character Base64-like strings
  • S3 Tables ARNs must follow format: arn:aws:s3tables:region:account-id:bucket/bucket-name
  • Endpoints must start with http:// or https://
  • Database volumes must reference existing volume idents
  • Schema databases must reference existing database idents
  • Table metadata locations must be accessible from the volume’s object store

Bootstrap Behavior

  1. Volumes are created first in the order defined
  2. Databases are created and linked to their volumes
  3. Schemas are created within databases (default public schema is auto-created)
  4. Tables are registered from their metadata locations
  5. If items already exist, they are skipped (idempotent)
  6. The should_refresh flag triggers metadata reload for S3 Tables volumes