Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark dataframe CSV vs Parquet

I am beginner in Spark and trying to understand the mechanics of spark dataframes. I am comparing performance of sql queries on spark sql dataframe when loading data from csv verses parquet. My understanding is once the data is loaded to a spark dataframe, it shouldn't matter where the data was sourced from (csv or parquet). However I see significant performance difference between the two. I am loading the data using the following commands and there writing queries against it.

dataframe_csv = sqlcontext.read.format("csv").load()

dataframe_parquet = sqlcontext.read.parquet()

Please explain the reason for the difference.

like image 592
dataapp Avatar asked Nov 21 '25 17:11

dataapp


1 Answers

The reason because you see differente performance between csv & parquet is because parquet has a columnar storage and csv has plain text format. Columnar storage is better for achieve lower storage size but plain text is faster at read from a dataframe.

like image 113
Alejandro Sánchez Muñoz Avatar answered Nov 24 '25 22:11

Alejandro Sánchez Muñoz



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!