Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/embucket/embucket/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Embucket can be configured using:
  • Command-line flags: ./embucketd --flag value
  • Environment variables: export VAR=value
  • Configuration files: YAML files for metastore configuration
All CLI flags have corresponding environment variable equivalents.

Server Configuration

Network Settings

--host
string
default:"localhost"
Host address to bind the server to.
  • Use 0.0.0.0 to accept connections from any network interface
  • Use localhost or 127.0.0.1 for local-only access
  • Specify a specific IP address to bind to a particular interface
Example:
./embucketd --host 0.0.0.0
--port
number
default:"3000"
Port number for the Snowflake-compatible API server.Example:
./embucketd --port 8080
--timeout
number
default:"18000"
Service idle timeout in seconds. Connections idle for longer than this duration may be closed.Default: 18000 seconds (5 hours)

Metastore Configuration

--metastore-config
string
Path to YAML configuration file describing volumes, databases, schemas, and tables.Example:
./embucketd --metastore-config /opt/embucket/config/metastore.yaml
See Metastore Configuration File section below for YAML format.

Metastore Configuration File

The metastore configuration file defines external catalogs and table locations.

S3 Tables (AWS S3 Table Buckets)

volumes:
  - ident: demo
    type: s3-tables
    database: demo
    credentials:
      credential_type: access_key
      aws-access-key-id: YOUR_ACCESS_KEY
      aws-secret-access-key: YOUR_SECRET_KEY
    arn: arn:aws:s3tables:us-east-2:123456789012:bucket/my-table-bucket

S3 with External Iceberg Tables

volumes:
  - ident: lakehouse
    type: s3
    region: us-east-2
    bucket: YOUR_BUCKET_NAME
    credentials:
      credential_type: access_key
      aws-access-key-id: YOUR_ACCESS_KEY
      aws-secret-access-key: YOUR_SECRET_KEY

databases:
  - ident: demo
    volume: lakehouse

schemas:
  - database: demo
    schema: tpch_10

tables:
  - database: demo
    schema: tpch_10
    table: customer
    metadata_location: s3://YOUR_BUCKET_NAME/tpch_10/customer/metadata/00001-eea1cccb-38a4-4fe2-8c95-c01dae9d0c60.metadata.json
  - database: demo
    schema: tpch_10
    table: lineitem
    metadata_location: s3://YOUR_BUCKET_NAME/tpch_10/lineitem/metadata/00001-d777220e-d508-4033-a229-8c4c8d8fe514.metadata.json
External Iceberg tables must be in the same bucket as the volume definition.

Query Execution Settings

Concurrency and Timeouts

--max-concurrency-level
number
default:"8"
Maximum number of queries that can run concurrently.
  • Higher values increase throughput but require more memory
  • Set based on available CPU cores and memory
Example:
./embucketd --max-concurrency-level 16
--query-timeout-secs
number
default:"1200"
Maximum duration in seconds a single query is allowed to run before being terminated.Default: 1200 seconds (20 minutes)Example:
./embucketd --query-timeout-secs 600
--max-concurrent-table-fetches
number
default:"2"
Maximum number of concurrent requests to fetch table metadata.
  • Increase for faster catalog scanning with many tables
  • May increase load on catalog service

Memory and Disk Pools

Memory Pool Configuration

--mem-pool-type
string
default:"greedy"
Memory pool allocation strategy:
  • greedy: Allocates memory aggressively, may use all available pool for a single query
  • fair: Distributes memory more evenly across concurrent queries
Example:
./embucketd --mem-pool-type fair
--mem-pool-size-mb
number
Maximum memory pool size in megabytes for query execution.
  • If not set, uses system-determined defaults
  • Recommended: 50-70% of available system memory
Example:
./embucketd --mem-pool-size-mb 8192
--mem-enable-track-consumers-pool
boolean
Enable tracking of per-consumer (per-query) memory usage.
  • Useful for debugging memory-intensive queries
  • Adds slight overhead
Example:
./embucketd --mem-enable-track-consumers-pool true

Disk Pool Configuration

--disk-pool-size-mb
number
Maximum disk pool size in megabytes for query spilling.
  • Used when queries exceed memory pool size
  • Requires fast disk (SSD recommended)
Example:
./embucketd --disk-pool-size-mb 20480

SQL Configuration

--data-format
string
default:"json"
Data serialization format for the Snowflake v1 API.Default: json
--sql-parser-dialect
string
default:"snowflake"
SQL parser dialect to use for query parsing.Options: snowflake, postgres, mysql, genericExample:
./embucketd --sql-parser-dialect snowflake

Authentication

--auth-demo-user
string
default:"embucket"
Username for demo authentication mode.Example:
./embucketd --auth-demo-user myuser
--auth-demo-password
string
default:"embucket"
Password for demo authentication mode.Example:
./embucketd --auth-demo-password mypassword
--jwt-secret
string
Secret key for JWT token signing. Values are hidden in logs for security.
Always change this from the default in production deployments. Use a cryptographically secure random string.
Example:
export JWT_SECRET="your-secure-random-string-here"
./embucketd

Tracing and Monitoring

Logging Configuration

--tracing-level
string
default:"info"
Default tracing/logging level.Options:
  • off: Disable all logging
  • info: Standard operational logs
  • debug: Detailed debugging information
  • trace: Very verbose trace-level logs
This setting is overridden by the RUST_LOG environment variable if set.
Example:
./embucketd --tracing-level debug
--alloc-tracing
boolean
Enable memory allocation tracing for debugging memory usage.
  • Adds performance overhead
  • Useful for identifying memory leaks or inefficient allocations
Example:
./embucketd --alloc-tracing true

OpenTelemetry Configuration

--otel-exporter-otlp-protocol
string
default:"grpc"
OpenTelemetry OTLP exporter protocol.Options:
  • grpc: Use gRPC protocol (default)
  • http/json: Use HTTP with JSON encoding
Example:
./embucketd --otel-exporter-otlp-protocol grpc
--tracing-span-processor
string
default:"batch-span-processor"
Tracing span processor type.Options:
  • batch-span-processor: Batch spans for efficient export
  • batch-span-processor-experimental-async-runtime: Experimental async batch processor

OTLP Endpoint

Configure the OpenTelemetry collector endpoint using standard OTLP environment variables:
export OTEL_EXPORTER_OTLP_ENDPOINT=https://your-collector:4317
export OTEL_EXPORTER_OTLP_PROTOCOL=grpc
./embucketd

AWS SDK Timeouts

These settings control timeouts for AWS SDK operations (S3, S3 Tables, etc.).
--aws-sdk-connect-timeout-secs
number
default:"3"
Timeout for establishing AWS SDK connections in seconds.
--aws-sdk-operation-timeout-secs
number
default:"30"
Total timeout for AWS SDK operations in seconds.
--aws-sdk-operation-attempt-timeout-secs
number
default:"10"
Timeout for individual AWS SDK operation attempts in seconds (before retry).
Example:
./embucketd \
  --aws-sdk-connect-timeout-secs 5 \
  --aws-sdk-operation-timeout-secs 60 \
  --aws-sdk-operation-attempt-timeout-secs 15

Iceberg Timeouts

--iceberg-table-timeout-secs
number
default:"30"
Timeout for Iceberg table operations (create, load metadata) in seconds.
--iceberg-catalog-timeout-secs
number
default:"10"
Timeout for Iceberg catalog operations in seconds.
Example:
./embucketd \
  --iceberg-table-timeout-secs 60 \
  --iceberg-catalog-timeout-secs 20

Object Store Timeouts

--object-store-timeout-secs
number
default:"10"
Timeout for object store operations (reads, writes) in seconds.
--object-store-connect-timeout-secs
number
default:"3"
Timeout for establishing object store connections in seconds.
Example:
./embucketd \
  --object-store-timeout-secs 30 \
  --object-store-connect-timeout-secs 5

Configuration Examples

Development Configuration

Quick local development setup:
./embucketd \
  --host localhost \
  --port 3000 \
  --tracing-level debug \
  --max-concurrency-level 4 \
  --query-timeout-secs 300

Production Configuration

Optimized production deployment:
./embucketd \
  --host 0.0.0.0 \
  --port 3000 \
  --metastore-config /opt/embucket/config/metastore.yaml \
  --max-concurrency-level 16 \
  --query-timeout-secs 1800 \
  --mem-pool-type greedy \
  --mem-pool-size-mb 16384 \
  --disk-pool-size-mb 51200 \
  --tracing-level info \
  --aws-sdk-operation-timeout-secs 60 \
  --object-store-timeout-secs 30

Environment Variables Configuration

Using environment variables for cleaner configuration:
# Create environment file
cat > embucket.env << EOF
BUCKET_HOST=0.0.0.0
BUCKET_PORT=3000
METASTORE_CONFIG=/opt/embucket/config/metastore.yaml
QUERY_TIMEOUT_SECS=1800
MAX_CONCURRENCY_LEVEL=16
MEM_POOL_TYPE=greedy
MEM_POOL_SIZE_MB=16384
DISK_POOL_SIZE_MB=51200
TRACING_LEVEL=info
JWT_SECRET=your-secure-secret-here
AWS_SDK_OPERATION_TIMEOUT_SECS=60
OBJECT_STORE_TIMEOUT_SECS=30
EOF

# Load and run
source embucket.env
./embucketd

High-Performance Configuration

For compute-intensive workloads:
./embucketd \
  --max-concurrency-level 32 \
  --mem-pool-type fair \
  --mem-pool-size-mb 32768 \
  --disk-pool-size-mb 102400 \
  --max-concurrent-table-fetches 8 \
  --query-timeout-secs 3600

Docker Environment Variables

When running in Docker, these additional environment variables are commonly used:
docker run -p 3000:3000 \
  -e BUCKET_HOST=0.0.0.0 \
  -e OBJECT_STORE_BACKEND=s3 \
  -e AWS_ACCESS_KEY_ID=your-key \
  -e AWS_SECRET_ACCESS_KEY=your-secret \
  -e AWS_REGION=us-east-2 \
  -e S3_BUCKET=your-bucket \
  -e S3_ENDPOINT=http://minio:9000 \
  -e S3_ALLOW_HTTP=true \
  -e QUERY_TIMEOUT_SECS=1200 \
  -e MAX_CONCURRENCY_LEVEL=8 \
  embucket/embucket

Next Steps

Docker

Deploy with Docker

AWS Lambda

Serverless deployment

Binary

Standalone binary setup