Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Pandas to convert CSV to Parquet using Fastparquet

I am using Python 3.6 interpreter in my PyCharm venv, and trying to convert a CSV to Parquet.

import pandas as pd    
df = pd.read_csv('/parquet/drivers.csv')
df.to_parquet('output.parquet')

Error-1 ImportError: Unable to find a usable engine; tried using: 'pyarrow', 'fastparquet'. pyarrow or fastparquet is required for parquet support

Solution-1 Installed fastparquet 0.2.1

Error-2 File "/Users/python parquet/venv/lib/python3.6/site-packages/fastparquet/compression.py", line 131, in compress_data (algorithm, sorted(compressions))) RuntimeError: Compression 'snappy' not available. Options: ['GZIP', 'UNCOMPRESSED']

I Installed python-snappy 0.5.3 but still getting the same error? Do I need to install any other library?

If I use PyArrow 0.12.0 engine, I don't experience the issue.

like image 574
Himalay Majumdar Avatar asked Oct 29 '25 04:10

Himalay Majumdar


1 Answers

In fastparquet snappy compression is an optional feature.

To quickly check a conversion from csv to parquet, you can execute the following script (only requires pandas and fastparquet):

import pandas as pd
from fastparquet import write, ParquetFile
df = pd.DataFrame({"col1": [1,2,3,4], "col2": ["a","b","c","d"]})
# df.head() # Test your initial value
df.to_csv("/tmp/test_csv", index=False)
df_csv = pd.read_csv("/tmp/test_csv")
df_csv.head() # Test your intermediate value
df_csv.to_parquet("/tmp/test_parquet", compression="GZIP")
df_parquet = ParquetFile("/tmp/test_parquet").to_pandas()
df_parquet.head() # Test your final value

However, if you need to write or read using snappy compression you might follow this answer about installing snappy library on ubuntu.

like image 95
MarcosBernal Avatar answered Oct 30 '25 17:10

MarcosBernal