I have to load up a CSV file from HDFS using Spark into DataFrame
. I was wondering if there is a "performance" improvement (query speed) from a DataFrame backed by a CSV file vs one backed by a parquet file?
Typically, I load a CSV file like the following into a data frame.
val df1 = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load("hdfs://box/path/to/file.csv")
On the other hand, loading a parquet file (assuming I've parsed the CSV file, created a schema, and saved it to HDFS) looks like the following.
val df2 = sqlContext.read.parquet("hdfs://box/path/to/file.parquet")
Now I'm wondering if operations like the following query times would be impacted and/or different.
I'm wondering if anyone knows if there is predicate-pushdown for parquet?
To me, it seems parquet is somewhat like an inverted-index, and it would be expected that simple filters for count would be faster for a data frame based on parquet than one on CSV. As for the CSV-backed data frame, I would imagine that a full data set scan would have to occur each time we filter for items.
Any clarifications on CSV vs parquet-backed data frames query performance is appreciated. Also, any file format that will help in speeding up query counts in data frames is also welcomed.
CSV is a row-oriented format, while Parquet is a column-oriented format.
Typically row-oriented formats are more efficient for queries that either must access most of the columns, or only read a fraction of the rows. Column-oriented formats, on the other hand, are usually more efficient for queries that need to read most of the rows, but only have to access a fraction of the columns. Analytical queries typically fall in the latter category, while transactional queries are more often in the first category.
Additionally, CSV is a text-based format, which can not be parsed as efficiently as a binary format. This makes CSV even slower. A typical column-oriented format on the other hand is not only binary, but also allows more efficient compression, which leads to smaller disk usage and faster access. I recommend reading the Introduction section of The Design and Implementation of Modern Column-Oriented Database Systems.
Since the Hadoop ecosystem is for analytical queries, Parquet is generally a better choice for performance than CSV for Hadoop applications.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With