My dataset is partitioned in this way: <pre class="prettyprint"><code>Year=yyyy |---Month=mm | |---Day=dd | | |---<parquet-files> </code></pre> What is the easiest and efficient way to create a dataframe in spark loaded with data between two dates?

If you absolutely have to stick to this partitioning strategy, the answer depends on whether you are willing to bear partition discovery costs or not. If you are willing to have Spark discover all partitions, which only needs to happen once (until you add new files), you can load the basepath and then filter using the partition columns. If you do not want Spark to discover all the partitions, e.g., because you have millions of files, the only efficient general solution is to break the interval you want to query for into several sub-intervals you can easily query for using @r0bb23's approach and then union then together. If you want the best of both cases above and you have a stable schema, you can register the partitions in the metastore by defining an external partitioned table. Don't do this if you expect your schema to evolve as metastore-managed tables manage schema evolution quite poorly at this time. For example, to query between <code>2017-10-06</code> and <code>2017-11-03</code> you'd do: <pre class="prettyprint"><code>// With full discovery spark.read.parquet("hdfs:///basepath") .where('Year === 2017 && ( ('Month === 10 && 'Day >= 6') || ('Month === 11 && 'Day <= 3') )) // With partial discovery val df1 = spark.read.option("basePath", "hdfs:///basepath/") .parquet("hdfs:///basepath/Year=2017/Month=10/Day={0[6-9], [1-3][0-9]}/*/") val df2 = spark.read.option("basePath", "hdfs:///basepath/") .parquet("hdfs:///basepath/Year=2017/Month=11/Day={0[1-3]}/*/") val df = df1.union(df2) </code></pre> Writing generic code for this is certainly possible but I haven't encountered it. The better approach is to partition in the manner outlined in the comment I made to the question. If your table was partitioned using something like <code>/basepath/ts=yyyymmddhhmm/*.parquet</code> then the answer is simply: <pre class="prettyprint"><code>spark.read.parquet("hdfs:///basepath") .where('ts >= 201710060000L && 'ts <= 201711030000L) </code></pre> The reason why it's worth adding hours & minutes is that you can then write generic code that handles intervals regardless of whether you have the data partitioned by week, day, hour, or every 15 mins. In fact you can even manage data with different granularity in the same table, e.g., older data is aggregated at higher levels to reduce the total number of partitions that need to be discovered.

Edited to add multiple load paths to address comment. You can use a regex style syntax. <pre class="prettyprint"><code>val dataset = spark .read .format("parquet") .option("filterPushdown", "true") .option("basePath", "hdfs:///basepath/") .load("hdfs:///basepath/Year=2017/Month=10/Day={0[6-9],[1-3][0-9]}/*/", "hdfs:///basepath/Year=2017/Month=11/Day={0[1-3]}/*/") </code></pre> How to use regex to include/exclude some input files in sc.textFile? Note: you don't need the <code>X=*</code> you can just do <code>*</code> if you want all days, months, etc. You should probably also do some reading about Predicate Pushdown (ie filterPushdown set to true above). Finally, you will notice the basepath option above, the reason for that can be found here: Prevent DataFrame.partitionBy() from removing partitioned columns from schema

Spark SQL queries on partitioned data using Date Ranges

Tags:

apache-spark

apache-spark-sql

My dataset is partitioned in this way:

Year=yyyy
 |---Month=mm
 |   |---Day=dd
 |   |   |---<parquet-files>

What is the easiest and efficient way to create a dataframe in spark loaded with data between two dates?

792

asked Nov 08 '17 22:11

r4ravi2008

2 Answers

If you absolutely have to stick to this partitioning strategy, the answer depends on whether you are willing to bear partition discovery costs or not.

If you are willing to have Spark discover all partitions, which only needs to happen once (until you add new files), you can load the basepath and then filter using the partition columns.

If you do not want Spark to discover all the partitions, e.g., because you have millions of files, the only efficient general solution is to break the interval you want to query for into several sub-intervals you can easily query for using @r0bb23's approach and then union then together.

If you want the best of both cases above and you have a stable schema, you can register the partitions in the metastore by defining an external partitioned table. Don't do this if you expect your schema to evolve as metastore-managed tables manage schema evolution quite poorly at this time.

For example, to query between 2017-10-06 and 2017-11-03 you'd do:

// With full discovery
spark.read.parquet("hdfs:///basepath")
  .where('Year === 2017 && (
    ('Month === 10 && 'Day >= 6') || ('Month === 11 && 'Day <= 3')
  ))

// With partial discovery
val df1 = spark.read.option("basePath", "hdfs:///basepath/")
  .parquet("hdfs:///basepath/Year=2017/Month=10/Day={0[6-9], [1-3][0-9]}/*/")
val df2 = spark.read.option("basePath", "hdfs:///basepath/")
  .parquet("hdfs:///basepath/Year=2017/Month=11/Day={0[1-3]}/*/")
val df = df1.union(df2)

Writing generic code for this is certainly possible but I haven't encountered it. The better approach is to partition in the manner outlined in the comment I made to the question. If your table was partitioned using something like /basepath/ts=yyyymmddhhmm/*.parquet then the answer is simply:

spark.read.parquet("hdfs:///basepath")
  .where('ts >= 201710060000L && 'ts <= 201711030000L)

The reason why it's worth adding hours & minutes is that you can then write generic code that handles intervals regardless of whether you have the data partitioned by week, day, hour, or every 15 mins. In fact you can even manage data with different granularity in the same table, e.g., older data is aggregated at higher levels to reduce the total number of partitions that need to be discovered.

163

answered Oct 03 '22 16:10

Sim

Edited to add multiple load paths to address comment.

You can use a regex style syntax.

val dataset = spark
  .read
  .format("parquet")
  .option("filterPushdown", "true")
  .option("basePath", "hdfs:///basepath/")
  .load("hdfs:///basepath/Year=2017/Month=10/Day={0[6-9],[1-3][0-9]}/*/",
    "hdfs:///basepath/Year=2017/Month=11/Day={0[1-3]}/*/")

How to use regex to include/exclude some input files in sc.textFile?

Note: you don't need the X=* you can just do * if you want all days, months, etc.

You should probably also do some reading about Predicate Pushdown (ie filterPushdown set to true above).

Finally, you will notice the basepath option above, the reason for that can be found here: Prevent DataFrame.partitionBy() from removing partitioned columns from schema

answered Oct 01 '22 16:10

Robert Beatty

Related questions
                            
                                Intermittent Timeout Exception using Spark
                            
                                What is the difference between spark's shuffle read and shuffle write?
                            
                                Tips for properly using large broadcast variables?
                            
                                Convert Spark Row to typed Array of Doubles
                            
                                How to process RDDs using a Python class?
                            
                                Spark DataFrame aggregate column values by key into List
                            
                                inferSchema in spark-csv package
                            
                                How to allow spark to ignore missing input files?
                            
                                How to Store a Python bytestring in a Spark Dataframe
                            
                                Why do Scala 2.11 and Spark with scallop lead to "java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror"?
                            
                                Spark dataframes groupby into list
                            
                                Fast Parquet row count in Spark
                            
                                Optimizing GC on EMR cluster
                            
                                Spark 2.2.0 FileOutputCommitter
                            
                                pyspark Window.partitionBy vs groupBy
                            
                                My Spark's Worker cannot connect Master.Something wrong with Akka?
                            
                                Spark using PySpark read images
                            
                                Spark SQL "<=>" operator
                            
                                Spark groupByKey alternative
                            
                                Python spark extract characters from dataframe

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With