Storage

DuckDB


DuckDB is a high-performance SQL database system optimized for analytical workloads. It can directly query Iceberg tables stored in your analytics buckets, making it ideal for data exploration and complex analytical queries.

Installation#

Install DuckDB and the Iceberg extension:

1
pip install duckdb duckdb-iceberg

Connecting to Analytics buckets#

Here's a complete example of connecting to your Supabase analytics bucket and querying Iceberg tables:

1
import duckdb
2
import os
3
4
# Configuration
5
PROJECT_REF = "your-project-ref"
6
WAREHOUSE = "your-analytics-bucket-name"
7
SERVICE_KEY = "your-service-key"
8
9
# S3 credentials
10
S3_ACCESS_KEY = "your-access-key"
11
S3_SECRET_KEY = "your-secret-key"
12
S3_REGION = "us-east-1"
13
14
# Construct endpoints
15
S3_ENDPOINT = f"https://{PROJECT_REF}.supabase.co/storage/v1/s3"
16
CATALOG_URI = f"https://{PROJECT_REF}.supabase.co/storage/v1/iceberg"
17
18
# Initialize DuckDB connection
19
conn = duckdb.connect(":memory:")
20
21
# Install and load the Iceberg extension
22
conn.install_extension("iceberg")
23
conn.load_extension("iceberg")
24
25
# Configure Iceberg catalog with Supabase credentials
26
conn.execute(f"""
27
CREATE SECRET (
28
TYPE S3,
29
KEY_ID '{S3_ACCESS_KEY}',
30
SECRET '{S3_SECRET_KEY}',
31
REGION '{S3_REGION}',
32
ENDPOINT '{S3_ENDPOINT}',
33
URL_STYLE 'virtual'
34
);
35
""")
36
37
# Configure the REST catalog
38
conn.execute(f"""
39
ATTACH 'iceberg://{CATALOG_URI}' AS iceberg_catalog
40
(
41
TYPE ICEBERG_REST,
42
WAREHOUSE '{WAREHOUSE}',
43
TOKEN '{SERVICE_KEY}'
44
);
45
""")
46
47
# Query your Iceberg tables
48
result = conn.execute("""
49
SELECT *
50
FROM iceberg_catalog.default.events
51
LIMIT 10
52
""").fetchall()
53
54
for row in result:
55
print(row)
56
57
# Complex aggregation example
58
analytics = conn.execute("""
59
SELECT
60
event_name,
61
COUNT(*) as event_count,
62
COUNT(DISTINCT user_id) as unique_users
63
FROM iceberg_catalog.default.events
64
GROUP BY event_name
65
ORDER BY event_count DESC
66
""").fetchdf()
67
68
print(analytics)

Key features with DuckDB#

Efficient data exploration#

DuckDB's lazy evaluation means it only scans the data you need:

1
# This only reads the columns you select
2
events = conn.execute("""
3
SELECT event_id, event_name, event_timestamp
4
FROM iceberg_catalog.default.events
5
WHERE event_timestamp > NOW() - INTERVAL '7 days'
6
""").fetchdf()

Converting to Pandas#

Convert results to Pandas DataFrames for further analysis:

1
df = conn.execute("""
2
SELECT *
3
FROM iceberg_catalog.default.events
4
""").fetchdf()
5
6
# Use pandas for visualization or further processing
7
print(df.describe())

Exporting results#

Save your analytical results to various formats:

1
# Export to Parquet
2
conn.execute("""
3
COPY (
4
SELECT * FROM iceberg_catalog.default.events
5
) TO 'results.parquet'
6
""")
7
8
# Export to CSV
9
conn.execute("""
10
COPY (
11
SELECT event_name, COUNT(*) as count
12
FROM iceberg_catalog.default.events
13
GROUP BY event_name
14
) TO 'summary.csv' (FORMAT CSV, HEADER true)
15
""")

Best practices#

  • Connection pooling - Reuse connections for multiple queries
  • Partition pruning - Filter by partition columns to improve query performance
  • Column selection - Only select columns you need to reduce I/O
  • Limit results - Use LIMIT during exploration to avoid processing large datasets

Next steps#