Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert a csv file to parquet

Tags:

java

parquet

I'm new to BigData.I need to convert a csv/txt file to Parquet format. I searched a lot but couldn't find any direct way to do so. Is there any way to achieve that?

like image 255
author243 Avatar asked Sep 30 '14 15:09

author243


People also ask

Can we convert CSV file to Parquet?

Convert a CSV file into an Apache Parquet file for big space savings! The Observable team has used this with DuckDB to create many of our Curated Datasets. You can try it out with a big CSV file, for example the most recent Stack Overflow Survey results, an 80mb CSV that compresses down to a 5mb Parquet file!


1 Answers

I already posted an answer on how to do this using Apache Drill. However, if you are familiar with Python, you can now do this using Pandas and PyArrow!

Install dependencies

Using pip:

pip install pandas pyarrow 

or using conda:

conda install pandas pyarrow -c conda-forge 

Convert CSV to Parquet in chunks

# csv_to_parquet.py  import pandas as pd import pyarrow as pa import pyarrow.parquet as pq  csv_file = '/path/to/my.tsv' parquet_file = '/path/to/my.parquet' chunksize = 100_000  csv_stream = pd.read_csv(csv_file, sep='\t', chunksize=chunksize, low_memory=False)  for i, chunk in enumerate(csv_stream):     print("Chunk", i)     if i == 0:         # Guess the schema of the CSV file from the first chunk         parquet_schema = pa.Table.from_pandas(df=chunk).schema         # Open a Parquet file for writing         parquet_writer = pq.ParquetWriter(parquet_file, parquet_schema, compression='snappy')     # Write CSV chunk to the parquet file     table = pa.Table.from_pandas(chunk, schema=parquet_schema)     parquet_writer.write_table(table)  parquet_writer.close() 

I haven't benchmarked this code against the Apache Drill version, but in my experience it's plenty fast, converting tens of thousands of rows per second (this depends on the CSV file of course!).


Edit:

We can now read CSV files directly into PyArrow Tables using pyarrow.csv.read_csv. This is probably faster than using the Pandas CSV reader, although it may be less flexible.

like image 137
ostrokach Avatar answered Sep 20 '22 14:09

ostrokach