I'm trying to use DuckDB in a jupyter notebook to access and query some parquet files held in s3, but can't seem to get it to work. Judging on past experience, I feel like I need to assign the appropriate file system but I'm not sure how/where to do that.
The below code raises the error: RuntimeError: IO Error: No files found that match the pattern "s3://<bucket>/<file>.parquet"
import boto3
import duckdb
s3 = boto3.resource('s3')
client=boto3.client("s3")
con = duckdb.connect(database=':memory:', read_only=False)
con.execute("""
SET s3_region='-----';
SET s3_access_key_id='-----';
SET s3_secret_access_key='-----';
""")
out = con.execute(f"select * from parquet_scan('s3://<bucket>/<file>.parquet') limit 10;").fetchall()
I'd like to use the pandas read_sql functionality if I can, but put this code to avoid adding complexity to the question.
I'm confused because this code works:
import pandas as pd
import boto3
s3 = boto3.resource('s3')
client=boto3.client("s3")
df = pd.read_parquet("s3://<bucket>/<file>.parquet")
Do this instead
con.execute("""
INSTALL httpfs;
LOAD httpfs;
SET s3_region='-----';
SET s3_access_key_id='-----';
SET s3_secret_access_key='-----';
""")
def query_s3_duckdb(bucket_name, object_name):
try:
# Connect to DuckDB and load Parquet file
conn = duckdb.connect()
conn.execute("INSTALL httpfs")
conn.execute("LOAD httpfs")
conn.execute("SET s3_region='us-east-1'")
conn.execute("SET s3_access_key_id='xxxxxxxxxxxxxxxxx'")
conn.execute("SET s3_secret_access_key='xxxxxxxxxxxxxxxxxx'")
conn.execute(f"CREATE TABLE data AS SELECT * FROM read_parquet('s3://{bucket_name}/{object_name}')")
# Query data
result = conn.execute("SELECT COUNT(*) FROM data")
print(f"Number of rows in data: {result.fetchone()[0]}")
# Cleanup
conn.close()
except S3Error as e:
print(f"Error querying data from S3: {e}")
This should work!!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With