Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark SQL: How to consume json data from a REST service as DataFrame

I need to read some JSON data from a web service thats providing REST interfaces to query the data from my SPARK SQL code for analysis. I am able to read a JSON stored in the blob store and use it.

I was wondering what is the best way to read the data from a REST service and use it like a any other DataFrame.

BTW I am using SPARK 1.6 of Linux cluster on HD insight if that helps. Also would appreciate if someone can share any code snippets for the same as I am still very new to SPARK environment.

like image 330
Kiran Avatar asked May 09 '16 10:05

Kiran


1 Answers

Spark cannot parse an arbitrary json to dataframe, because json is hierarchical structure and dataframe as flat. If your json is not created by spark, chances are that it does not comply to condition "Each line must contain a separate, self-contained valid JSON object" and hence will need to be parsed using your custom code and then feed to dataframe as collection of case-class objects or spark sql Rows.

You can download like:

import scalaj.http._
val response = Http("proto:///path/to/json")
  .header("key", "val").method("get")
  .execute().asString.body

and then parse your json as shown in this answer. And then create a Seq of objects of your case-class (say seq) and create a dataframe as

seq.toDF
like image 95
Ashish Awasthi Avatar answered Oct 19 '22 17:10

Ashish Awasthi