Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is querying against a Spark DataFrame based on CSV faster than one based on Parquet?

I have to load up a CSV file from HDFS using Spark into DataFrame. I was wondering if there is a "performance" improvement (query speed) from a DataFrame backed by a CSV file vs one backed by a parquet file?

Typically, I load a CSV file like the following into a data frame.

val df1 = sqlContext.read
 .format("com.databricks.spark.csv")
 .option("header", "true")
 .option("inferSchema", "true")
 .load("hdfs://box/path/to/file.csv")

On the other hand, loading a parquet file (assuming I've parsed the CSV file, created a schema, and saved it to HDFS) looks like the following.

val df2 = sqlContext.read.parquet("hdfs://box/path/to/file.parquet")

Now I'm wondering if operations like the following query times would be impacted and/or different.

  • df1.where("col1='some1'").count()
  • df1.where("col1='some1' and col2='some2'").count()

I'm wondering if anyone knows if there is predicate-pushdown for parquet?

To me, it seems parquet is somewhat like an inverted-index, and it would be expected that simple filters for count would be faster for a data frame based on parquet than one on CSV. As for the CSV-backed data frame, I would imagine that a full data set scan would have to occur each time we filter for items.

Any clarifications on CSV vs parquet-backed data frames query performance is appreciated. Also, any file format that will help in speeding up query counts in data frames is also welcomed.

like image 672
Jane Wayne Avatar asked Sep 16 '16 23:09

Jane Wayne


1 Answers

CSV is a row-oriented format, while Parquet is a column-oriented format.

Typically row-oriented formats are more efficient for queries that either must access most of the columns, or only read a fraction of the rows. Column-oriented formats, on the other hand, are usually more efficient for queries that need to read most of the rows, but only have to access a fraction of the columns. Analytical queries typically fall in the latter category, while transactional queries are more often in the first category.

Additionally, CSV is a text-based format, which can not be parsed as efficiently as a binary format. This makes CSV even slower. A typical column-oriented format on the other hand is not only binary, but also allows more efficient compression, which leads to smaller disk usage and faster access. I recommend reading the Introduction section of The Design and Implementation of Modern Column-Oriented Database Systems.

Since the Hadoop ecosystem is for analytical queries, Parquet is generally a better choice for performance than CSV for Hadoop applications.

like image 76
Zoltan Avatar answered Oct 10 '22 05:10

Zoltan