Embucket uses Apache Iceberg as its table format, providing ACID transactions, schema evolution, and time travel capabilities on your data lake. This page explains how Iceberg integrates with Embucket’s architecture.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/embucket/embucket/llms.txt
Use this file to discover all available pages before exploring further.
Why Apache Iceberg?
Apache Iceberg solves critical challenges in data lake architectures:ACID Transactions
Full transactional guarantees for concurrent reads and writes
Schema Evolution
Add, remove, or rename columns without rewriting data
Time Travel
Query historical table snapshots at any point in time
Hidden Partitioning
Automatic partition pruning without partition filters in queries
Performance
Column-level statistics and file-level metrics for query optimization
Vendor Neutral
Open format works with any compute engine (Spark, Trino, Flink, etc.)
Iceberg Table Format
An Iceberg table consists of three layers of metadata:Metadata File
The metadata file (*.metadata.json) is the root pointer for an Iceberg table. It contains:
- Schema: Column names, types, and IDs with evolution history
- Partition spec: Partitioning strategy (if any)
- Sort order: Physical data layout preferences
- Snapshots: List of table versions with timestamp and manifest list pointers
- Current snapshot: Active table version
- Properties: Table configuration (format version, write settings, etc.)
crates/catalog/src/table.rs for the table loading implementation.
Manifest List
Each snapshot points to a manifest list (Avro file) that tracks all data files in that version:- Manifest files: List of manifest file locations
- Partition summaries: Min/max bounds for each partition
- Added/deleted files: Change tracking from previous snapshot
Manifest Files
Manifest files (also Avro) contain detailed information about data files:- Data file path: Location of Parquet file in object storage
- Partition values: Values for partitioned columns
- Record count: Number of rows in the file
- File size: Byte size for I/O estimation
- Column stats: Min/max values, null counts, distinct counts
- Status: Added, deleted, or existing
Data Files
Data files (typically Parquet) contain the actual table data:- Columnar format: Efficient compression and encoding per column
- Schema embedded: Self-describing with column metadata
- Statistics: File-level and row-group-level min/max values
object_store library.
ACID Guarantees
Iceberg provides full ACID (Atomicity, Consistency, Isolation, Durability) guarantees:Atomicity
All-or-nothing commits: Changes to a table are committed atomically by writing a new metadata file. If the commit fails (e.g., network error), the old metadata remains unchanged and no partial state is visible.crates/executor/src/query.rs during write operations.
Consistency
Snapshot isolation: Each snapshot represents a consistent view of the table at a specific point in time. Queries always see a single consistent snapshot, even if concurrent writes are occurring. Schema consistency: Schema changes are versioned and tracked. Old snapshots continue to use their original schema, enabling time travel queries.Isolation
Optimistic concurrency control: Iceberg uses optimistic locking for concurrent writes. Multiple writers can commit changes to the same table, with the catalog ensuring only one commit succeeds for conflicting changes. Conflict resolution:- Non-conflicting writes: Different partitions or files can be committed concurrently
- Conflicting writes: Retry logic attempts to merge compatible changes
- Failed commits: Transaction fails if changes cannot be safely merged
Durability
Durable commits: Once a metadata file is successfully written to object storage, the changes are durable. Object storage (S3, etc.) provides durability guarantees. No external coordinator: Iceberg doesn’t require a separate locking service or coordinator. The atomic metadata file update provides coordination.Catalog Integration
Embucket integrates with Iceberg catalogs to discover and manage tables.Catalog Types
- Embucket Native
- AWS S3 Table Buckets
- REST Catalog (Experimental)
Configuration-based catalog for explicit table definitions.Define tables in Embucket loads these tables at startup and tracks metadata location pointers in memory.
metastore.yaml:Catalog Operations
Embucket supports standard catalog operations via SQL: List databases:Table Operations
Embucket provides full read and write support for Iceberg tables.Reading Tables
Query execution:- Load table metadata from catalog
- Identify current snapshot
- Read manifest list for snapshot
- Filter manifest files by partition bounds
- Read relevant manifest files
- Filter data files by partition and column stats
- Generate physical plan to read data files
- Execute plan with partition and column pruning
datafusion-iceberg) for efficient table scans with predicate pushdown.
Writing Tables
INSERT operations:- Execute query plan to produce result data
- Write new Parquet data files to object storage
- Create new manifest file(s) listing new data files
- Create new manifest list including all manifests
- Write new metadata file with new snapshot
- Atomically commit by updating metastore pointer
crates/executor/src/datafusion/logical_plan/merge.rs for MERGE implementation.
Schema Evolution
Iceberg supports schema changes without rewriting data: Add column:Time Travel
Query historical table versions using Iceberg snapshots: Snapshot ID:Time travel syntax support varies. Check Embucket version for current implementation status.
Performance Optimization
Embucket leverages Iceberg features for query performance:Partition Pruning
Hidden partitioning: Iceberg supports partition transforms that don’t require users to write partition filters:Column Statistics
Min/max filtering: Iceberg tracks column min/max values per data file:File-Level Metadata
Record count estimation: File-level record counts enable accurate cardinality estimation for query planning. Size-based planning: Embucket uses file sizes to balance work across parallel scan tasks.Compaction
Optimize table layout for better performance:- Fewer files to list and open
- Better compression ratios
- Improved column statistics
- Reduced metadata overhead
Storage Benefits
Apache Iceberg provides several storage advantages:No Vendor Lock-In
Open format: Your data remains accessible to any Iceberg-compatible tool:- Apache Spark
- Apache Flink
- Trino / Presto
- AWS Athena
- Snowflake (Iceberg tables)
- Dremio
- StarRocks
Cost Efficiency
Object storage: Data stored in cost-effective object storage (S3, GCS, Azure Blob) rather than proprietary formats. Pay for what you store: No hidden storage markup. Standard object storage pricing applies. Compression: Parquet format with efficient compression reduces storage costs.Data Governance
Audit trail: Snapshot history provides complete audit trail of all table changes. Rollback: Revert to previous table versions if needed:Best Practices
Choose appropriate partitioning
Partition on high-cardinality columns used in WHERE clauses (e.g., date, region)
Troubleshooting
Common Issues
“Table metadata not found” errors:- Verify metadata location path is correct and accessible
- Check object storage credentials and permissions
- Ensure metadata file exists at specified location
- Review partition strategy - too many small files?
- Check if partition pruning is working (examine query plan)
- Consider running OPTIMIZE to compact small files
- Verify column statistics are available
- Multiple writers attempting conflicting changes
- Implement retry logic in application
- Consider partitioning strategy to reduce write conflicts
- Expire old snapshots no longer needed
- Consider metadata compaction
- Review snapshot retention policies
Next Steps
Architecture
Learn about Embucket’s overall architecture
Snowflake Compatibility
Understand Snowflake feature support