I'm loading in high-dimensional parquet files but only need a few columns. My current code looks like:
dat = sqc.parquetFile(path) \
.filter(lambda r: len(r.a)>0) \
.map(lambda r: (r.a, r.b, r.c))
My mental model of what's happening is that it's loading in all the data, then throwing out the columns I don't want. I'd obviously prefer it to not even read in those columns, and from what I understand about parquet that seems to be possible.
So there are two questions:
sqc.parquetFile()
to read in data more efficiently?CSV should generally be the fastest to write, JSON the easiest for a human to understand and Parquet the fastest to read. CSV is the defacto standard of a lot of data and for fair reasons; it's [relatively] easy to comprehend for both users and computers and made more accessible via Microsoft Excel.
You should use Spark DataFrame API: https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#dataframe-operations
Something like
dat.select("a", "b", "c").filter(lambda r: len(r.a)>0)
Or you can use Spark SQL:
dat.regiserTempTable("dat")
sqc.sql("select a, b, c from dat where length(a) > 0")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With