How to read partitioned parquet with condition as dataframe, this works fine, <pre class="prettyprint"><code>val dataframe = sqlContext.read.parquet("file:///home/msoproj/dev_data/dev_output/aln/partitions/data=jDD/year=2015/month=10/day=25/*") </code></pre> Partition is there for <code>day=1 to day=30</code> is it possible to read something like<code>(day = 5 to 6)</code> or <code>day=5,day=6</code>, <pre class="prettyprint"><code>val dataframe = sqlContext.read.parquet("file:///home/msoproj/dev_data/dev_output/aln/partitions/data=jDD/year=2015/month=10/day=??/*") </code></pre> If I put <code>*</code> it gives me all 30 days data and it too big.

<code>sqlContext.read.parquet</code> can take multiple paths as input. If you want just <code>day=5</code> and <code>day=6</code>, you can simply add two paths like: <pre class="prettyprint"><code>val dataframe = sqlContext .read.parquet("file:///your/path/data=jDD/year=2015/month=10/day=5/", "file:///your/path/data=jDD/year=2015/month=10/day=6/") </code></pre> If you have folders under <code>day=X</code>, like say <code>country=XX</code>, <code>country</code> will automatically be added as a column in the <code>dataframe</code>. <blockquote> EDIT: As of Spark 1.6 one needs to provide a "basepath"-option in order for Spark to generate columns automatically. In Spark 1.6.x the above would have to be re-written like this to create a dataframe with the columns "data", "year", "month" and "day": </blockquote> <pre class="prettyprint"><code>val dataframe = sqlContext .read .option("basePath", "file:///your/path/") .parquet("file:///your/path/data=jDD/year=2015/month=10/day=5/", "file:///your/path/data=jDD/year=2015/month=10/day=6/") </code></pre>

Reading DataFrame from partitioned parquet file

Tags:

scala

apache-spark

parquet

spark-dataframe

How to read partitioned parquet with condition as dataframe,

this works fine,

val dataframe = sqlContext.read.parquet("file:///home/msoproj/dev_data/dev_output/aln/partitions/data=jDD/year=2015/month=10/day=25/*")

Partition is there for day=1 to day=30 is it possible to read something like(day = 5 to 6) or day=5,day=6,

val dataframe = sqlContext.read.parquet("file:///home/msoproj/dev_data/dev_output/aln/partitions/data=jDD/year=2015/month=10/day=??/*")

If I put * it gives me all 30 days data and it too big.

237

asked Nov 11 '15 12:11

WoodChopper

1 Answers

sqlContext.read.parquet can take multiple paths as input. If you want just day=5 and day=6, you can simply add two paths like:

val dataframe = sqlContext       .read.parquet("file:///your/path/data=jDD/year=2015/month=10/day=5/",                      "file:///your/path/data=jDD/year=2015/month=10/day=6/")

If you have folders under day=X, like say country=XX, country will automatically be added as a column in the dataframe.

EDIT: As of Spark 1.6 one needs to provide a "basepath"-option in order for Spark to generate columns automatically. In Spark 1.6.x the above would have to be re-written like this to create a dataframe with the columns "data", "year", "month" and "day":

val dataframe = sqlContext      .read      .option("basePath", "file:///your/path/")      .parquet("file:///your/path/data=jDD/year=2015/month=10/day=5/",                      "file:///your/path/data=jDD/year=2015/month=10/day=6/")

195

answered Sep 21 '22 21:09

Glennie Helles Sindholt

Related questions
                            
                                How to make a right-associative infix operator?
                            
                                How to run sbt multiple command in interactive mode as one command? [duplicate]
                            
                                Add element to a list In Scala
                            
                                Scala safe way of converting String to Enumeration value
                            
                                How does | (pipe) in pattern matching work?
                            
                                How to flatten a List of Futures in Scala
                            
                                If an Int can't be null, what does null.asInstanceOf[Int] mean?
                            
                                How to create correct data frame for classification in Spark ML
                            
                                What Are The Benefits Of Scala? [closed]
                            
                                Scala: Can I rely on the order of items in a Set?
                            
                                Shuffle a list of integers with Java 8 Streams API
                            
                                scala string.split does not work
                            
                                Filter map for values of None
                            
                                Method parameters validation in Scala, with for comprehension and monads
                            
                                How do you turn a Scala list into pairs?
                            
                                How to find duplicates in a list?
                            
                                Formatting binary values in Scala
                            
                                Scala - case match partial string
                            
                                Change nullable property of column in spark dataframe
                            
                                How to check for null or false in Scala concisely?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With