Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading DataFrame from partitioned parquet file

How to read partitioned parquet with condition as dataframe,

this works fine,

val dataframe = sqlContext.read.parquet("file:///home/msoproj/dev_data/dev_output/aln/partitions/data=jDD/year=2015/month=10/day=25/*") 

Partition is there for day=1 to day=30 is it possible to read something like(day = 5 to 6) or day=5,day=6,

val dataframe = sqlContext.read.parquet("file:///home/msoproj/dev_data/dev_output/aln/partitions/data=jDD/year=2015/month=10/day=??/*") 

If I put * it gives me all 30 days data and it too big.

like image 237
WoodChopper Avatar asked Nov 11 '15 12:11

WoodChopper


People also ask

Are Parquet files partitioned?

As a reminder, Parquet files are partitioned. When we say “Parquet file”, we are actually referring to multiple physical files, each of them being a partition. This directory structure makes it easy to add new data every day, but it only works well when you make time-based analysis.


1 Answers

sqlContext.read.parquet can take multiple paths as input. If you want just day=5 and day=6, you can simply add two paths like:

val dataframe = sqlContext       .read.parquet("file:///your/path/data=jDD/year=2015/month=10/day=5/",                      "file:///your/path/data=jDD/year=2015/month=10/day=6/") 

If you have folders under day=X, like say country=XX, country will automatically be added as a column in the dataframe.

EDIT: As of Spark 1.6 one needs to provide a "basepath"-option in order for Spark to generate columns automatically. In Spark 1.6.x the above would have to be re-written like this to create a dataframe with the columns "data", "year", "month" and "day":

val dataframe = sqlContext      .read      .option("basePath", "file:///your/path/")      .parquet("file:///your/path/data=jDD/year=2015/month=10/day=5/",                      "file:///your/path/data=jDD/year=2015/month=10/day=6/") 
like image 195
Glennie Helles Sindholt Avatar answered Sep 21 '22 21:09

Glennie Helles Sindholt