I'm new to BigData.I need to convert a csv/txt file to Parquet format. I searched a lot but couldn't find any direct way to do so. Is there any way to achieve that?
Convert a CSV file into an Apache Parquet file for big space savings! The Observable team has used this with DuckDB to create many of our Curated Datasets. You can try it out with a big CSV file, for example the most recent Stack Overflow Survey results, an 80mb CSV that compresses down to a 5mb Parquet file!
I already posted an answer on how to do this using Apache Drill. However, if you are familiar with Python, you can now do this using Pandas and PyArrow!
Using pip
:
pip install pandas pyarrow
or using conda
:
conda install pandas pyarrow -c conda-forge
# csv_to_parquet.py import pandas as pd import pyarrow as pa import pyarrow.parquet as pq csv_file = '/path/to/my.tsv' parquet_file = '/path/to/my.parquet' chunksize = 100_000 csv_stream = pd.read_csv(csv_file, sep='\t', chunksize=chunksize, low_memory=False) for i, chunk in enumerate(csv_stream): print("Chunk", i) if i == 0: # Guess the schema of the CSV file from the first chunk parquet_schema = pa.Table.from_pandas(df=chunk).schema # Open a Parquet file for writing parquet_writer = pq.ParquetWriter(parquet_file, parquet_schema, compression='snappy') # Write CSV chunk to the parquet file table = pa.Table.from_pandas(chunk, schema=parquet_schema) parquet_writer.write_table(table) parquet_writer.close()
I haven't benchmarked this code against the Apache Drill version, but in my experience it's plenty fast, converting tens of thousands of rows per second (this depends on the CSV file of course!).
Edit:
We can now read CSV files directly into PyArrow Tables using pyarrow.csv.read_csv
. This is probably faster than using the Pandas CSV reader, although it may be less flexible.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With