Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Use directories for partition pruning in Spark SQL

I have data files (json in this example but could also be avro) written in a directory structure like:

dataroot
+-- year=2015
    +-- month=06
        +-- day=01
            +-- data1.json
            +-- data2.json
            +-- data3.json
        +-- day=02
            +-- data1.json
            +-- data2.json
            +-- data3.json
    +-- month=07
        +-- day=20
            +-- data1.json
            +-- data2.json
            +-- data3.json
        +-- day=21
            +-- data1.json
            +-- data2.json
            +-- data3.json
        +-- day=22
            +-- data1.json
            +-- data2.json

Using spark-sql I create a temporary table:

CREATE TEMPORARY TABLE dataTable
USING org.apache.spark.sql.json
OPTIONS (
  path "dataroot/*"
)

Querying the table works well but I'm so far not able to use the directories for pruning.

Is there a way to register the directory structure as partitions (without using Hive) to avoid scanning the whole tree when I query? Say I want to compare data for the first day of every month and only read directories for these days.

With Apache Drill I can use directories as predicates during query time with dir0 etc. Is it possible to do something similar with Spark SQL?

like image 564
Lundahl Avatar asked Sep 28 '22 05:09

Lundahl


1 Answers

As far as I know partitioning autodiscovery only works for parquet files in SparkSQL. See http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery

like image 60
Arnon Rotem-Gal-Oz Avatar answered Nov 24 '22 13:11

Arnon Rotem-Gal-Oz