Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using DuckDB with s3?

I'm trying to use DuckDB in a jupyter notebook to access and query some parquet files held in s3, but can't seem to get it to work. Judging on past experience, I feel like I need to assign the appropriate file system but I'm not sure how/where to do that.

The below code raises the error: RuntimeError: IO Error: No files found that match the pattern "s3://<bucket>/<file>.parquet"

import boto3
import duckdb

s3 = boto3.resource('s3')
client=boto3.client("s3")
con = duckdb.connect(database=':memory:', read_only=False)

con.execute("""
SET s3_region='-----';
SET s3_access_key_id='-----';
SET s3_secret_access_key='-----';
""")
out = con.execute(f"select * from parquet_scan('s3://<bucket>/<file>.parquet') limit 10;").fetchall()

I'd like to use the pandas read_sql functionality if I can, but put this code to avoid adding complexity to the question.

I'm confused because this code works:

import pandas as pd
import boto3

s3 = boto3.resource('s3')
client=boto3.client("s3")

df = pd.read_parquet("s3://<bucket>/<file>.parquet")
like image 572
Ethan Avatar asked Feb 21 '26 19:02

Ethan


2 Answers

Do this instead

con.execute("""
INSTALL httpfs;
LOAD httpfs;
SET s3_region='-----';
SET s3_access_key_id='-----';
SET s3_secret_access_key='-----';
""")
like image 195
Cesar Ades Avatar answered Feb 24 '26 09:02

Cesar Ades


def query_s3_duckdb(bucket_name, object_name):
    try:
        # Connect to DuckDB and load Parquet file
        conn = duckdb.connect()
        conn.execute("INSTALL httpfs")
        conn.execute("LOAD httpfs")
        conn.execute("SET s3_region='us-east-1'")
        conn.execute("SET s3_access_key_id='xxxxxxxxxxxxxxxxx'")
        conn.execute("SET s3_secret_access_key='xxxxxxxxxxxxxxxxxx'")
        conn.execute(f"CREATE TABLE data AS SELECT * FROM read_parquet('s3://{bucket_name}/{object_name}')")

        # Query data
        result = conn.execute("SELECT COUNT(*) FROM data")
        print(f"Number of rows in data: {result.fetchone()[0]}")

        # Cleanup
        conn.close()
    except S3Error as e:
        print(f"Error querying data from S3: {e}")

This should work!!

like image 22
naveen tata Avatar answered Feb 24 '26 10:02

naveen tata



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!