Documentation Index
Fetch the complete documentation index at: https://mintlify.com/embucket/embucket/llms.txt
Use this file to discover all available pages before exploring further.
What are Catalogs?
Catalogs in Embucket organize and manage metadata about your data lake tables. They provide a hierarchical namespace for organizing databases, schemas, and tables, similar to how traditional databases structure their metadata. A catalog serves as the root of your data hierarchy:Why Catalogs Matter
Catalogs are essential for:- Organization: Structure your data lake with familiar database/schema/table hierarchy
- Metadata Management: Track table locations, schemas, and properties in a centralized way
- Access Control: Define permissions and access patterns at different levels
- Multi-Tenancy: Isolate data for different teams or projects using separate catalogs
- Query Routing: Direct queries to the appropriate storage backends automatically
The Hierarchy Model
Embucket uses a four-level hierarchy to organize data:Volume
The physical storage backend where data resides. A volume can be S3, S3 Tables, local filesystem, or in-memory storage. Volumes are the foundation that catalogs build upon.Database
A logical grouping of schemas within a volume. Each database is associated with exactly one volume and represents a catalog in Snowflake terms.Schema
A namespace within a database that contains tables. Schemas provide an additional level of organization and are commonly used to separate different datasets or environments (e.g.,production, staging, public).
Table
The actual data structure containing rows and columns. Tables in Embucket use Apache Iceberg format for ACID transactions and time travel capabilities.Supported Catalog Types
Embucket supports multiple catalog types to fit different use cases:S3 Tables
Native AWS S3 Table Buckets with automatic metadata management
External Iceberg
Connect to existing Iceberg tables on S3 or other storage
In-Memory
Temporary storage for development and testing
File System
Local filesystem storage for single-node deployments
Configuring Catalogs
Catalogs are defined using a YAML configuration file passed toembucketd at startup:
Choosing the Right Catalog Type
Select your catalog type based on your requirements:Use S3 Tables When:
- You want AWS-managed metadata and automatic indexing
- You’re building a new data lake on AWS
- You need tight integration with AWS services
- You want AWS to handle metadata availability and scaling
Use External Iceberg When:
- You have existing Iceberg tables from other tools (Spark, Trino, etc.)
- You need maximum flexibility in table organization
- You want to manage metadata files yourself
- You need to support multiple query engines
Use In-Memory When:
- Running tests or development experiments
- Creating temporary tables for query processing
- Prototyping without persistence requirements
Use File System When:
- Deploying on a single node without cloud storage
- Local development and testing
- Low-latency access on the same machine
Next Steps
Configure S3 Tables
Set up AWS S3 Table Buckets catalog
Connect External Iceberg
Register existing Iceberg tables
Metastore Configuration
Complete YAML schema reference