Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/embucket/embucket/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Embucket can be deployed as a single, self-contained binary. This deployment method provides:
  • No dependencies or external runtime requirements
  • Direct control over execution environment
  • Minimal resource overhead
  • Easy integration with existing infrastructure

Building from Source

Prerequisites

1

Install Rust

Install Rust using rustup:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
2

Install system dependencies

Install required build dependencies:Debian/Ubuntu:
apt-get update && apt-get install -y \
  cmake \
  pkg-config \
  libssl-dev \
  ca-certificates
macOS:
brew install cmake pkg-config openssl
RHEL/Fedora:
dnf install cmake pkg-config openssl-devel ca-certificates

Build Process

git clone https://github.com/Embucket/embucket.git
cd embucket
cargo build
./target/debug/embucketd
Release builds are optimized and significantly faster than debug builds. Always use --release for production deployments.

System Requirements

Minimum Requirements

CPU

  • Minimum: 2 CPU cores
  • Recommended: 4+ cores
  • x86_64 or ARM64 architecture

Memory

  • Minimum: 2 GB RAM
  • Recommended: 4-8 GB RAM
  • More memory for larger queries

Disk

  • Minimum: 1 GB for binary and metadata
  • Recommended: 10+ GB for query spilling
  • SSD recommended for better I/O

Network

  • Low-latency connection to object storage
  • Bandwidth depends on query workload
  • For S3: Same region deployment recommended

Operating Systems

  • Linux (Ubuntu 20.04+, Debian 11+, RHEL 8+, Fedora 35+)
  • macOS (11.0+)
  • Other Unix-like systems with Rust support

Running the Binary

Basic Usage

Start Embucket with default settings:
./embucketd
The server starts on localhost:3000 by default.

With Configuration File

Start with a metastore configuration:
./embucketd --metastore-config config/metastore.yaml

Custom Host and Port

Bind to a specific host and port:
./embucketd --host 0.0.0.0 --port 8080

Using Environment Variables

Configure via environment variables:
export BUCKET_HOST=0.0.0.0
export BUCKET_PORT=3000
export METASTORE_CONFIG=config/metastore.yaml
export QUERY_TIMEOUT_SECS=600
export MAX_CONCURRENCY_LEVEL=16

./embucketd

Binary Flags and Options

All command-line flags can be set via environment variables or CLI arguments.

Server Configuration

--host
string
default:"localhost"
Host to bind to
--port
number
default:"3000"
Port to bind to
--timeout
number
default:"18000"
Service idle timeout in seconds

Metastore Configuration

--metastore-config
string
Path to YAML config describing volumes/databases to seed the metastore

Query Execution

--max-concurrency-level
number
default:"8"
Maximum number of running queries at the same time
--query-timeout-secs
number
default:"1200"
Maximum duration in seconds a single query is allowed to run
--max-concurrent-table-fetches
number
default:"2"
The maximum number of concurrent requests to get table details

Memory and Disk Pools

--mem-pool-type
string
default:"greedy"
Memory pool type for query execution: greedy or fair
--mem-pool-size-mb
number
Maximum memory pool size in megabytes
--mem-enable-track-consumers-pool
boolean
Wrap memory pool with TrackConsumersPool for tracking per-consumer memory usage
--disk-pool-size-mb
number
Maximum disk pool size in megabytes (for spilling)

Authentication

--auth-demo-user
string
default:"embucket"
User for auth demo
--auth-demo-password
string
default:"embucket"
Password for auth demo
--jwt-secret
string
JWT secret for auth (hidden in logs)

SQL Configuration

--data-format
string
default:"json"
Data serialization format in Snowflake v1 API
--sql-parser-dialect
string
default:"snowflake"
SQL parser dialect: snowflake, postgres, mysql, generic, etc.

Tracing and Monitoring

--tracing-level
string
default:"info"
Tracing level: off, info, debug, or trace (overridden by RUST_LOG)
--tracing-span-processor
string
default:"batch-span-processor"
Tracing span processor
--otel-exporter-otlp-protocol
string
default:"grpc"
OpenTelemetry Exporter Protocol: grpc or http/json
--alloc-tracing
boolean
Enable memory tracing functionality

AWS SDK Timeouts

--aws-sdk-connect-timeout-secs
number
default:"3"
AWS SDK connect timeout in seconds
--aws-sdk-operation-timeout-secs
number
default:"30"
AWS SDK operation timeout in seconds
--aws-sdk-operation-attempt-timeout-secs
number
default:"10"
AWS SDK operation attempt timeout in seconds

Iceberg Timeouts

--iceberg-table-timeout-secs
number
default:"30"
Iceberg create table timeout in seconds
--iceberg-catalog-timeout-secs
number
default:"10"
Iceberg catalog timeout in seconds

Object Store Timeouts

--object-store-timeout-secs
number
default:"10"
Object store timeout in seconds
--object-store-connect-timeout-secs
number
default:"3"
Object store connect timeout in seconds

Systemd Service

Run Embucket as a systemd service for automatic startup and management.

Service File

Create /etc/systemd/system/embucket.service:
[Unit]
Description=Embucket Lakehouse Server
After=network.target

[Service]
Type=simple
User=embucket
Group=embucket
WorkingDirectory=/opt/embucket
EnvironmentFile=/opt/embucket/embucket.env
ExecStart=/opt/embucket/embucketd --metastore-config /opt/embucket/config/metastore.yaml
Restart=on-failure
RestartSec=10
StandardOutput=journal
StandardError=journal
SyslogIdentifier=embucket

# Security hardening
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/opt/embucket/data

# Resource limits
LimitNOFILE=65536
LimitNPROC=4096

[Install]
WantedBy=multi-user.target

Environment File

Create /opt/embucket/embucket.env:
BUCKET_HOST=0.0.0.0
BUCKET_PORT=3000
QUERY_TIMEOUT_SECS=1200
MAX_CONCURRENCY_LEVEL=16
MEM_POOL_TYPE=greedy
MEM_POOL_SIZE_MB=8192
DISK_POOL_SIZE_MB=20480
TRACING_LEVEL=info
JWT_SECRET=your-secure-jwt-secret-here

Installation

# Create user and directories
sudo useradd -r -s /bin/false embucket
sudo mkdir -p /opt/embucket/{config,data}
sudo chown -R embucket:embucket /opt/embucket

# Copy binary
sudo cp target/release/embucketd /opt/embucket/
sudo chmod +x /opt/embucket/embucketd

# Install service
sudo cp embucket.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable embucket
sudo systemctl start embucket

Service Management

# Start service
sudo systemctl start embucket

# Stop service
sudo systemctl stop embucket

# Restart service
sudo systemctl restart embucket

# Check status
sudo systemctl status embucket

# View logs
sudo journalctl -u embucket -f

Production Considerations

Reverse Proxy

Use nginx or similar for:
  • TLS termination
  • Load balancing
  • Request rate limiting
  • Access logging

Monitoring

Integrate with monitoring systems:
  • Prometheus metrics (via OTLP)
  • CloudWatch (if on AWS)
  • Datadog, New Relic, etc.

High Availability

Deploy multiple instances:
  • Place behind load balancer
  • Share object storage backend
  • Query-per-node architecture

Security

Security best practices:
  • Change default credentials
  • Use strong JWT secret
  • Run as non-root user
  • Enable firewall rules

Next Steps

Configuration

Explore all configuration options

Docker Deployment

Deploy using Docker containers