How to convert a csv file to parquet

1 Answers

I already posted an answer on how to do this using Apache Drill. However, if you are familiar with Python, you can now do this using Pandas and PyArrow!

Install dependencies

Using pip:

pip install pandas pyarrow

or using conda:

conda install pandas pyarrow -c conda-forge

Convert CSV to Parquet in chunks

# csv_to_parquet.py  import pandas as pd import pyarrow as pa import pyarrow.parquet as pq  csv_file = '/path/to/my.tsv' parquet_file = '/path/to/my.parquet' chunksize = 100_000  csv_stream = pd.read_csv(csv_file, sep='\t', chunksize=chunksize, low_memory=False)  for i, chunk in enumerate(csv_stream):     print("Chunk", i)     if i == 0:         # Guess the schema of the CSV file from the first chunk         parquet_schema = pa.Table.from_pandas(df=chunk).schema         # Open a Parquet file for writing         parquet_writer = pq.ParquetWriter(parquet_file, parquet_schema, compression='snappy')     # Write CSV chunk to the parquet file     table = pa.Table.from_pandas(chunk, schema=parquet_schema)     parquet_writer.write_table(table)  parquet_writer.close()

I haven't benchmarked this code against the Apache Drill version, but in my experience it's plenty fast, converting tens of thousands of rows per second (this depends on the CSV file of course!).

Edit:

We can now read CSV files directly into PyArrow Tables using pyarrow.csv.read_csv. This is probably faster than using the Pandas CSV reader, although it may be less flexible.

137

answered Sep 20 '22 14:09

ostrokach

Related questions
                            
                                When do you use a JSP and when a Servlet? [duplicate]
                            
                                Where I can find @Inject jar
                            
                                pointerIndex out of range Android multitouch
                            
                                How to display a number with always 2 decimal points using BigDecimal?
                            
                                How to read file from ZIP using InputStream?
                            
                                JavaFX FXML API version warning
                            
                                Parse RSS pubDate to Date object in java
                            
                                Getting compiler error while using array constants in the constructor
                            
                                How to change size of title's text on Action Bar?
                            
                                Passing ArrayList as value only and not reference
                            
                                Lambda reference to a field
                            
                                Get the currency format for a country that does not have a Locale constant
                            
                                How to do an inline if/otherwise (aka ternary operator) in Velocity?
                            
                                Where is my app placed when deploying to Tomcat using IntelliJ IDEA?
                            
                                Vertical line between matching curly braces for java in eclipse
                            
                                Incompatible JVM in GGTS (Eclipse) and JAVA 1.8
                            
                                Replace new line/return with space using regex
                            
                                Understanding @SuppressLint("NewApi") annotation
                            
                                java.lang.NoClassDefFoundError: javax/validation/ParameterNameProvider
                            
                                LINQ for Java tool [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to convert a csv file to parquet

Tags:

java

parquet

author243

People also ask

1 Answers

Install dependencies

Convert CSV to Parquet in chunks

ostrokach

Recent Activity

Donate For Us