Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python - read parquet file without pandas

Currently I'm using the code below on Python 3.5, Windows to read in a parquet file.

import pandas as pd

parquetfilename = 'File1.parquet'
parquetFile = pd.read_parquet(parquetfilename, columns=['column1', 'column2'])  

However, I'd like to do so without using pandas. How to best do this? I'm using both Python 2.7 and 3.6 on Windows.

like image 756
inquisitiveProgrammer Avatar asked Jun 22 '18 12:06

inquisitiveProgrammer


1 Answers

You can use duckdb for this. It's an embedded RDBMS similar to SQLite but with OLAP in mind. There's a nice Python API and a SQL function to import Parquet files:

import duckdb

conn = duckdb.connect(":memory:") # or a file name to persist the DB

# Keep in mind this doesn't support partitioned datasets,
# so you can only read one partition at a time
conn.execute("CREATE TABLE mydata AS SELECT * FROM parquet_scan('/path/to/mydata.parquet')")

# Export a query as CSV
conn.execute("COPY (SELECT * FROM mydata WHERE col = 'val') TO 'col_val.csv' WITH (HEADER 1, DELIMITER ',')")
like image 191
Edgar Ramírez Mondragón Avatar answered Oct 04 '22 20:10

Edgar Ramírez Mondragón