Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/embucket/embucket/llms.txt

Use this file to discover all available pages before exploring further.

Run Snowflake SQL dialect on your data lake in 30 seconds. Zero dependencies. Embucket is a single binary lakehouse that provides a wire-compatible Snowflake replacement and works with Apache Iceberg open table format. Perfect for simple deployments with the power of proven open source technologies.

What is Embucket?

Embucket is a streamlined solution for building and managing an open lakehouse platform. It combines the simplicity of a single binary deployment with the power of enterprise-grade query processing, offering:
  • Wire-compatible Snowflake API: Use your existing Snowflake queries, dbt projects, and BI tools without modification
  • Apache Iceberg storage: Your data stays in Apache Iceberg open table format on object storage with no vendor lock-in
  • Radical simplicity: Single binary deployment with zero external dependencies
  • Query-per-node architecture: Each instance handles complete queries independently for predictable performance
  • Horizontal scaling: Add nodes for more throughput without complex orchestration

Built on Proven Open Source

Embucket leverages battle-tested Apache projects:

Apache DataFusion

High-performance SQL execution engine with advanced query optimization

Apache Iceberg

ACID-compliant table format with time travel and schema evolution

Key Features

Snowflake SQL Dialect

Run your existing Snowflake SQL queries without modification. Embucket implements Snowflake’s SQL dialect, including:
  • Date and time functions (DATEADD, DATEDIFF, CURRENT_TIMESTAMP)
  • String manipulation functions
  • Aggregate and window functions
  • CTEs and complex subqueries

Apache Iceberg Native

Your data remains in the open Apache Iceberg format:
  • No lock-in: Access your data with any Iceberg-compatible tool
  • ACID transactions: Consistent reads and writes across concurrent operations
  • Time travel: Query historical versions of your data
  • Schema evolution: Add, drop, or modify columns without rewriting data

Simple Deployment

Embucket is designed for operational simplicity:
  • Single binary: No complex dependencies or runtime requirements
  • Multiple deployment modes: Run on bare metal, Docker, Kubernetes, or AWS Lambda
  • Minimal configuration: Start with defaults, configure only what you need
  • Self-contained: Embedded metastore for quick starts, external catalogs for production

Flexible Catalog Support

Connect to your data wherever it lives:
  • AWS S3 Table Buckets: Native integration with AWS S3 Tables catalog
  • External Iceberg tables: Point to existing Iceberg tables in S3, GCS, or Azure
  • REST catalog: Standard Iceberg REST catalog protocol support

Architecture Overview

Embucket follows a simple, scalable architecture:
┌─────────────────────────────────────────────────────────┐
│  Client Layer (Snowflake CLI, dbt, BI Tools)           │
└─────────────────┬───────────────────────────────────────┘
                  │ Snowflake Wire Protocol
┌─────────────────▼───────────────────────────────────────┐
│  Embucket Instance                                       │
│  ┌─────────────────────────────────────────────────┐   │
│  │ Snowflake API Layer                             │   │
│  └──────────────────┬──────────────────────────────┘   │
│  ┌──────────────────▼──────────────────────────────┐   │
│  │ Query Engine (Apache DataFusion)                │   │
│  └──────────────────┬──────────────────────────────┘   │
│  ┌──────────────────▼──────────────────────────────┐   │
│  │ Catalog Provider                                 │   │
│  └──────────────────┬──────────────────────────────┘   │
└───────────────────┬─┴──────────────────────────────────┘

┌───────────────────▼─────────────────────────────────────┐
│  Storage Layer (S3, GCS, Azure)                         │
│  └─ Apache Iceberg Tables                               │
└─────────────────────────────────────────────────────────┘

Query-Per-Node Model

Each Embucket instance processes complete queries independently:
  • No coordination overhead: Instances don’t need to communicate with each other
  • Predictable performance: Query performance is independent of cluster size
  • Simple scaling: Add more instances behind a load balancer for higher throughput
  • Fault isolation: One instance failure doesn’t affect others

Use Cases

Embucket is ideal for:

Data Lake Analytics

Run SQL analytics on your data lake without complex infrastructure

Snowflake Migration

Migrate workloads from Snowflake to open formats while maintaining compatibility

Edge Analytics

Deploy query engines close to your data with minimal resources

Development & Testing

Local Snowflake-compatible environment for development and CI/CD

Target Audience

Embucket is designed for:
  • Data Engineers: Building and maintaining data pipelines with Snowflake SQL
  • Analytics Engineers: Running dbt projects on open data lake formats
  • Platform Engineers: Deploying simple, scalable query engines
  • Data Teams: Transitioning from proprietary to open data platforms

Getting Started

Ready to try Embucket? Choose your path:

Quickstart

Get Embucket running in under 5 minutes

Installation

Detailed installation instructions for all platforms

License

Embucket is open source software licensed under the Apache 2.0 License. See the LICENSE file for details.