Documentation Index
Fetch the complete documentation index at: https://mintlify.com/embucket/embucket/llms.txt
Use this file to discover all available pages before exploring further.
Overview
The metastore configuration file defines how Embucket connects to data storage and organizes catalogs. This file is passed to Embucket at startup:- Volumes: Physical storage backends
- Databases: Logical catalog groupings
- Schemas: Namespaces within databases
- Tables: Individual table registrations
Configuration Schema
Top-Level Structure
Volumes Section
Volumes define the physical storage backends where data resides.Common Fields
Unique identifier for this volume. Referenced by databases and tables.
Storage backend type. Valid values:
s3- Amazon S3 or S3-compatible storages3-tables- AWS S3 Table Bucketsfile- Local filesystemmemory- In-memory storage (temporary)
Optional database name to create automatically for this volume. Shorthand for defining a database entry.
S3 Volume Type
For standard Amazon S3 or S3-compatible storage:AWS region for the S3 bucket (e.g.,
us-east-1, eu-west-2).S3 bucket name. Must contain only alphanumeric characters, hyphens, or underscores. Cannot start or end with hyphens or underscores.
Custom S3 endpoint URL. Required for S3-compatible storage (MinIO, Ceph, etc.). Must start with
http:// or https://.AWS credentials for bucket access. See Credentials Types below.
S3 Tables Volume Type
For AWS S3 Table Buckets:Amazon Resource Name (ARN) of the S3 Table Bucket. Format:The ARN must start with
arn:aws:s3tables: and include a valid region, account ID, and bucket name.Custom S3 Tables endpoint. Only needed for testing or non-standard AWS configurations.
File Volume Type
For local filesystem storage:Absolute path to the directory where data will be stored. Directory will be created if it doesn’t exist.
Memory Volume Type
For temporary in-memory storage:Credential Types
Credentials authenticate Embucket to cloud storage backends.Access Key Credentials
Most common authentication method:Must be
access_key for access key authentication.AWS access key ID. Must be exactly 20 alphanumeric characters.
AWS secret access key. Must be exactly 40 Base64-like characters (uppercase, lowercase, digits, +/=).
Optional session token for temporary credentials from AWS STS or IAM roles.
Token Credentials
For OAuth or other token-based authentication:Token authentication is less common. Most users should use access key credentials.
Databases Section
Databases are logical groupings of schemas and tables, associated with a specific volume.Unique database identifier. This is the database name used in SQL queries.
Volume identifier this database is associated with. Must match a
ident from the volumes section.If
true, Embucket periodically refreshes the catalog metadata to discover new tables or schema changes. Useful for S3 Tables catalogs.Schemas Section
Schemas provide namespaces within databases for organizing tables.Database this schema belongs to. Must match a database
ident.Schema name. Used in SQL queries:
A
public schema is automatically created for each database if not explicitly defined.Tables Section
Explicitly register external Iceberg tables by pointing to their metadata files.Database the table belongs to. Must match a database
ident.Schema the table belongs to. The schema must be defined in the
schemas section.Table name as it will appear in queries.
Full S3 URI to the Apache Iceberg metadata JSON file. This file must:
- Be in the same bucket as the database’s volume
- Be accessible with the volume’s credentials
- Contain valid Iceberg metadata
s3://BUCKET_NAME/path/to/table/metadata/00000-abc123.metadata.jsonEnvironment Variables
You can reference environment variables in configuration using${VAR_NAME} syntax:
Performance Tuning
Configure object store timeouts via environment variables:| Environment Variable | Default | Description |
|---|---|---|
OBJECT_STORE_TIMEOUT_SECS | 30 | Overall operation timeout |
OBJECT_STORE_CONNECT_TIMEOUT_SECS | 3 | Connection timeout |
AWS_SDK_CONNECT_TIMEOUT_SECS | 3 | AWS SDK connection timeout |
AWS_SDK_OPERATION_TIMEOUT_SECS | 30 | AWS SDK operation timeout |
AWS_SDK_OPERATION_ATTEMPT_TIMEOUT_SECS | 10 | AWS SDK retry attempt timeout |
ICEBERG_CREATE_TABLE_TIMEOUT_SECS | 30 | Table creation timeout |
ICEBERG_CATALOG_TIMEOUT_SECS | 10 | Catalog operation timeout |
Complete Examples
Multi-Catalog Setup
Mixed Storage Types
External Iceberg with Multiple Tables
Best Practices
Security
- Never commit credentials: Use environment variables or IAM roles
- Principle of least privilege: Grant only required S3 permissions
- Rotate credentials: Regularly update access keys
- Use session tokens: Leverage temporary credentials when possible
Organization
- Consistent naming: Use clear, descriptive identifiers for volumes and databases
- Schema separation: Use schemas to separate environments (prod, staging, dev)
- Logical grouping: Group related tables in the same schema
- Volume per environment: Separate production and non-production storage
Performance
- Co-locate data: Keep related tables in the same bucket to minimize latency
- Tune timeouts: Adjust timeout settings based on your network conditions
- Minimize metadata: Only register tables you actively query
- Use S3 Tables for discovery: Enable
should_refreshfor dynamic catalogs
Maintainability
- Document your config: Add comments explaining non-obvious settings
- Version control: Track configuration changes in git
- Environment-specific configs: Maintain separate files per environment
- Validate before deploy: Test configuration changes in staging first
Validation Rules
Volume Validation
identmust be non-empty- S3
bucketmust be alphanumeric with hyphens/underscores only - S3
bucketcannot start or end with hyphens/underscores endpointmust start withhttp://orhttps://- S3 Tables
arnmust match format:arn:aws:s3tables:region:account:bucket/name - Access key ID must be exactly 20 alphanumeric characters
- Secret access key must be exactly 40 Base64 characters
Database Validation
identmust be non-emptyvolumemust reference an existing volume identifier
Schema Validation
databasemust reference an existing database identifierschemamust be non-empty
Table Validation
databasemust reference an existing databaseschemamust reference an existing schematablemust be non-emptymetadata_locationmust be a valid S3 URI- Metadata file must exist and be accessible
- Metadata file must be in the same bucket as the database’s volume
Troubleshooting
Configuration Parse Errors
Symptom:Failed to parse metastore config
Solutions:
- Validate YAML syntax (indentation, colons, dashes)
- Check for typos in field names
- Ensure required fields are present
- Verify credential format matches exactly
Volume Connection Failures
Symptom:Failed to validate credentials or ObjectStore error
Solutions:
- Verify credentials have correct permissions
- Check network connectivity to S3/storage endpoint
- Validate bucket/ARN exists and is spelled correctly
- Ensure region matches bucket location
- Test credentials with AWS CLI:
aws s3 ls s3://bucket-name
Table Registration Errors
Symptom:Invalid metadata location or Metadata parse error
Solutions:
- Verify metadata file exists at specified location
- Check file is valid Iceberg metadata JSON
- Ensure table is in same bucket as volume
- Confirm credentials have read access
- Validate metadata file isn’t corrupted
Reference Errors
Symptom:Volume not found or Database not found
Solutions:
- Check identifiers match exactly (case-sensitive)
- Ensure volumes are defined before databases
- Verify databases are defined before schemas
- Confirm schemas exist before registering tables
Next Steps
S3 Tables Setup
Configure AWS S3 Table Buckets
External Iceberg
Register existing Iceberg tables
Deploy Embucket
Production deployment guide
Write Queries
Learn SQL syntax