Embucket is a single-binary lakehouse that provides a wire-compatible Snowflake replacement built on proven open source technologies. This page explains how the system components work together to execute queries on your data lake.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/embucket/embucket/llms.txt
Use this file to discover all available pages before exploring further.
System Overview
Embucket follows a query-per-node architecture where each instance handles complete queries independently, from API request to result delivery. This design enables horizontal scaling by simply adding more nodes for increased throughput.Query Engine
Apache DataFusion powers SQL execution with query planning and optimization
Storage Layer
Apache Iceberg provides ACID transactions and time travel on your data lake
API Layer
Snowflake-compatible REST API for seamless client integration
Catalog
Metastore manages database, schema, and table metadata
Core Components
API Layer (api-snowflake-rest)
The API layer implements the Snowflake SQL REST API v1, enabling compatibility with existing Snowflake clients, SDKs, and tools.
Key endpoints:
/session/v1/login-request- Authentication and session creation/queries/v1/query-request- Query submission and execution/queries/v1/abort-request- Query cancellation/session- Session management
- JWT-based authentication with configurable token expiration (default: 3 days)
- Session state tracking with automatic expiration
- Request compression and decompression support
- Snowflake-compatible response formatting (JSON)
crates/api-snowflake-rest/src/server/router.rs:24 for the router implementation.
Execution Service (executor)
The execution service (CoreExecutionService) orchestrates query execution using Apache DataFusion as the underlying query engine.
Core responsibilities:
- Session lifecycle management with expiration tracking
- Query submission, execution, and result delivery
- Concurrent query management with configurable limits (default: 100)
- Query timeout enforcement (default: 20 minutes)
- Memory and disk resource pool management
- Submit: Query accepted and assigned a unique ID
- Parse: SQL parsed using Snowflake dialect extensions
- Plan: DataFusion creates logical plan from AST
- Optimize: Query optimizer applies rules and rewrites
- Execute: Physical plan executed with streaming results
- Return: Results formatted and returned to client
crates/executor/src/service.rs:586).
Catalog System (catalog)
The catalog system bridges Embucket’s metastore and DataFusion’s catalog interface, enabling query engine access to metadata.
Catalog hierarchy:
- Embucket: Native catalogs backed by Iceberg tables
- S3 Tables: AWS S3 Table Buckets integration
- Memory: In-memory catalogs for temporary data
CachingTable wrapper normalizes schema case sensitivity and rewrites queries for Snowflake SQL compatibility (see crates/catalog/src/table.rs:24).
Metastore (catalog-metastore)
The metastore manages persistent metadata about databases, schemas, tables, and volumes.
Key concepts:
- Volume: Storage location configuration (S3, S3 Tables, local file, memory)
- Database: Top-level catalog associated with a volume
- Schema: Logical grouping of tables within a database
- Table: Apache Iceberg table with metadata location pointer
Query-Per-Node Architecture
Each Embucket instance operates independently without shared state beyond the metastore configuration. This stateless design enables: Benefits:- Simple horizontal scaling - add nodes behind a load balancer
- No coordination overhead between instances
- Isolated failure domains - one node failure doesn’t affect others
- Easy deployment - single binary with no dependencies
- DataFusion session contexts with user session state
- Running query registry for cancellation and monitoring
- Memory and disk resource pools for query execution
- Iceberg table metadata cache
crates/executor/src/service.rs:149 for the CoreExecutionService implementation.
Data Flow
End-to-end query execution follows this path: Key steps:- API Request: Client sends query with JWT token
- Authentication: Token validated and session metadata extracted
- Query Submission: Execution service spawns async task for query
- Planning: DataFusion parses SQL and creates logical plan
- Catalog Resolution: Table references resolved via catalog system
- Metadata Loading: Iceberg table metadata loaded from object storage
- Optimization: DataFusion applies query optimization rules
- Execution: Physical plan reads data files and computes results
- Result Formatting: Arrow RecordBatches converted to Snowflake JSON format
- Response: Results streamed back to client
Built on Apache DataFusion
Embucket uses Apache DataFusion as its query engine, providing:- Vectorized execution: Columnar processing using Apache Arrow
- Query optimization: Cost-based optimizer with 100+ optimization rules
- Streaming execution: Memory-efficient result streaming
- Extensibility: Custom functions, operators, and plan nodes
- Custom SQL parser with Snowflake dialect extensions
- Session state management for user context (database, schema, parameters)
- Physical plan optimization with runtime statistics
- Memory pool configuration for controlled resource usage
crates/executor/src/service.rs:232 for RuntimeEnv configuration.
Resource Management
Embucket provides fine-grained control over resource usage: Memory pools:- Fair: Distributes memory evenly across concurrent queries
- Greedy: Allows queries to use maximum available memory
- Optional consumer tracking for monitoring (top 5 consumers)
- Configurable disk pool size for sort/join operations
- Uses OS temporary directory by default
- Automatic cleanup on query completion
- Maximum concurrent queries (default: 100)
- Query timeout enforcement (default: 20 minutes)
- Automatic query cancellation on timeout
- Graceful cleanup on abort
crates/embucketd/src/cli.rs).
Next Steps
Snowflake Compatibility
Learn what Snowflake features are supported
Iceberg Integration
Understand how Embucket uses Apache Iceberg