Storage

DuckDB


DuckDB is a high-performance SQL database system optimized for analytical workloads. It can directly query Iceberg tables stored in your analytics buckets, making it ideal for data exploration and complex analytical queries.

Installation

Install DuckDB and the Iceberg extension:

1
pip install duckdb duckdb-iceberg

Connecting to Analytics buckets

Here's a complete example of connecting to your Supabase analytics bucket and querying Iceberg tables:

1
import duckdb
2
import os
3
4
# Configuration
5
PROJECT_REF = "your-project-ref"
6
WAREHOUSE = "your-analytics-bucket-name"
7
SERVICE_KEY = "your-service-key"
8
9
# S3 credentials
10
S3_ACCESS_KEY = "your-access-key"
11
S3_SECRET_KEY = "your-secret-key"
12
S3_REGION = "us-east-1"
13
14
# Construct endpoints
15
S3_ENDPOINT = f"https://{PROJECT_REF}.supabase.co/storage/v1/s3"
16
CATALOG_URI = f"https://{PROJECT_REF}.supabase.co/storage/v1/iceberg"
17
18
# Initialize DuckDB connection
19
conn = duckdb.connect(":memory:")
20
21
# Install and load the Iceberg extension
22
conn.install_extension("iceberg")
23
conn.load_extension("iceberg")
24
25
# Configure Iceberg catalog with Supabase credentials
26
conn.execute(f"""
27
CREATE SECRET (
28
TYPE S3,
29
KEY_ID '{S3_ACCESS_KEY}',
30
SECRET '{S3_SECRET_KEY}',
31
REGION '{S3_REGION}',
32
ENDPOINT '{S3_ENDPOINT}',
33
URL_STYLE 'virtual'
34
);
35
""")
36
37
# Configure the REST catalog
38
conn.execute(f"""
39
ATTACH 'iceberg://{CATALOG_URI}' AS iceberg_catalog
40
(
41
TYPE ICEBERG_REST,
42
WAREHOUSE '{WAREHOUSE}',
43
TOKEN '{SERVICE_KEY}'
44
);
45
""")
46
47
# Query your Iceberg tables
48
result = conn.execute("""
49
SELECT *
50
FROM iceberg_catalog.default.events
51
LIMIT 10
52
""").fetchall()
53
54
for row in result:
55
print(row)
56
57
# Complex aggregation example
58
analytics = conn.execute("""
59
SELECT
60
event_name,
61
COUNT(*) as event_count,
62
COUNT(DISTINCT user_id) as unique_users
63
FROM iceberg_catalog.default.events
64
GROUP BY event_name
65
ORDER BY event_count DESC
66
""").fetchdf()
67
68
print(analytics)

Key features with DuckDB

Efficient data exploration

DuckDB's lazy evaluation means it only scans the data you need:

1
# This only reads the columns you select
2
events = conn.execute("""
3
SELECT event_id, event_name, event_timestamp
4
FROM iceberg_catalog.default.events
5
WHERE event_timestamp > NOW() - INTERVAL '7 days'
6
""").fetchdf()

Converting to Pandas

Convert results to Pandas DataFrames for further analysis:

1
df = conn.execute("""
2
SELECT *
3
FROM iceberg_catalog.default.events
4
""").fetchdf()
5
6
# Use pandas for visualization or further processing
7
print(df.describe())

Exporting results

Save your analytical results to various formats:

1
# Export to Parquet
2
conn.execute("""
3
COPY (
4
SELECT * FROM iceberg_catalog.default.events
5
) TO 'results.parquet'
6
""")
7
8
# Export to CSV
9
conn.execute("""
10
COPY (
11
SELECT event_name, COUNT(*) as count
12
FROM iceberg_catalog.default.events
13
GROUP BY event_name
14
) TO 'summary.csv' (FORMAT CSV, HEADER true)
15
""")

Best practices

  • Connection pooling - Reuse connections for multiple queries
  • Partition pruning - Filter by partition columns to improve query performance
  • Column selection - Only select columns you need to reduce I/O
  • Limit results - Use LIMIT during exploration to avoid processing large datasets

Next steps