Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to More Efficiently Load Parquet Files in Spark (pySpark v1.2.0)

I'm loading in high-dimensional parquet files but only need a few columns. My current code looks like:

dat = sqc.parquetFile(path) \
          .filter(lambda r: len(r.a)>0) \
          .map(lambda r: (r.a, r.b, r.c))

My mental model of what's happening is that it's loading in all the data, then throwing out the columns I don't want. I'd obviously prefer it to not even read in those columns, and from what I understand about parquet that seems to be possible.

So there are two questions:

  1. Is my mental model wrong? Or is the spark compiler smart enough to only read in columns a, b, and c in the example above?
  2. How can I force sqc.parquetFile() to read in data more efficiently?
like image 283
jarfa Avatar asked Apr 22 '15 16:04

jarfa


People also ask

Is writing to Parquet faster than CSV?

CSV should generally be the fastest to write, JSON the easiest for a human to understand and Parquet the fastest to read. CSV is the defacto standard of a lot of data and for fair reasons; it's [relatively] easy to comprehend for both users and computers and made more accessible via Microsoft Excel.


1 Answers

You should use Spark DataFrame API: https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#dataframe-operations

Something like

dat.select("a", "b", "c").filter(lambda r: len(r.a)>0)

Or you can use Spark SQL:

dat.regiserTempTable("dat")
sqc.sql("select a, b, c from dat where length(a) > 0")
like image 129
kostya Avatar answered Sep 17 '22 13:09

kostya