Reading multiple files from S3 in Spark by date period

Description

I have an application, which sends data to AWS Kinesis Firehose and this writes the data into my S3 bucket. Firehose uses "yyyy/MM/dd/HH" format to write the files.

Like in this sample S3 path:

s3://mybucket/2016/07/29/12

Now I have a Spark application written in Scala, where I need to read data from a specific time period. I have start and end dates. The data is in JSON format and that's why I use sqlContext.read.json() not sc.textFile().

How can I read the data quickly and efficiently?

What have I tried?

Wildcards - I can select the data from all hours of a specific date or all dates of a specific month, for example:
```
val df = sqlContext.read.json("s3://mybucket/2016/07/29/*") val df = sqlContext.read.json("s3://mybucket/2016/07/*/*") 
```
But if I have to read data from the date period of a few days, for example 2016-07-29 - 2016-07-30 I cannot use the wildcard approach in the same way.

Which brings me to my next point...
Using multiple paths or a CSV of directories as presented by samthebest in this solution. It seems that separating directories with commas only works with sc.textFile() and not sqlContext.read.json().
Union - A second solution from the previous link by cloud suggests to read each directory separately and then union them together. Although he suggests unioning RDD-s, there's an option to union DataFrames as well. If I generate the date strings from given date period manually, then I may create a path that does not exist and instead of ignoring it, the whole reading fails. Instead I could use AWS SDK and use the function listObjects from AmazonS3Client to get all the keys like in iMKanchwala's solution from the previous link.

The only problem is that my data is constantly changing. If read.json() function gets all the data as a single parameter, it reads all the necessary data and is smart enough to infer the json schema from the data. If I read 2 directories separately and their schemas don't match, then I think unioning these two dataframes becomes a problem.
Glob(?) syntax - This solution by nhahtdh is a little better than options 1 and 2 because they provide the option to specify dates and directories in more detail and as a single "path" so it works also with read.json().

But again, a familiar problem occurs about the missing directories. Let's say I want all the data from 20.07 to 30.07, I can declare it like this:
```
val df = sqlContext.read.json("s3://mybucket/2016/07/[20-30]/*") 
```
But if I am missing data from let's say 25th of July, then the path ..16/07/25/ does not exist and the whole function fails.

And obviously it gets more difficult when the requested period is for example 25.11.2015-12.02.2016, then I would need to programmatically (in my Scala script) create a string path something like this:

"s3://mybucket/{2015/11/[25-30],2015/12/*,2016/01/*,2016/02/[01-12]}/*"

And by creating it, I would neet to somehow be sure that these 25-30 and 01-12 intervals all have corresponding paths, if one is missing, it fails again. (Asterisk fortunately deals with missing directories, as it reads everything that exists)

How can I read all the necessary data from a single directory path all at once without the possibility of failing because of a missing directory between some date interval?

279

asked Jul 29 '16 11:07

V. Samma

1 Answers

There is a much simpler solution. If you look at the DataFrameReader API you'll notice that there is a .json(paths: String*) method. Just build a collection of the paths you want, with globs of not, as you prefer, and then call the method, e.g.,

val paths: Seq[String] = ... val df = sqlContext.read.json(paths: _*)

167

answered Oct 06 '22 12:10

Sim

Related questions
                            
                                How to format strings in Scala?
                            
                                How to convert an Array to a Tuple?
                            
                                IntelliJ IDEA 13: new Scala SBT project hasn't src directory structure generated
                            
                                "Cannot find an implicit ExecutionContext" error in scala.js example app.
                            
                                Scala case class update value
                            
                                Questions about Scala from a Rubyist
                            
                                How do I append to a file in Scala?
                            
                                How to call .map() on a list of pairs in Scala
                            
                                Persistent data structures in Scala
                            
                                SparkSQL - Read parquet file directly
                            
                                Why is the `unary_` prefix needed in scala?
                            
                                Non-numerical use cases for functional programming? [closed]
                            
                                found Unit: required Int. Why is the error not obvious?
                            
                                How to add tracing within a 'for' comprehension?
                            
                                Find min and max elements of array
                            
                                Scala script to copy files
                            
                                Merge Spark output CSV files with a single header
                            
                                can't find method result on TableQuery with slick 3.0.0-RC1
                            
                                How to find unused sbt dependencies?
                            
                                ORM supporting immutable classes

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Reading multiple files from S3 in Spark by date period

Tags:

amazon-s3

scala

apache-spark

apache-spark-sql

aws-sdk

Description

What have I tried?

V. Samma

People also ask

1 Answers

Sim

Recent Activity

Donate For Us