Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I efficiently read multiple json files into a Dataframe or JavaRDD?

I can use the following code to read a single json file but I need to read multiple json files and merge them into one Dataframe. How can I do this?

DataFrame jsondf = sqlContext.read().json("/home/spark/articles/article.json");

Or is there a way to read multiple json files into JavaRDD then convert to Dataframe?

like image 655
Abu Sulaiman Avatar asked Nov 14 '15 16:11

Abu Sulaiman


1 Answers

To read multiple inputs in Spark, use wildcards. That's going to be true whether you're constructing a dataframe or an rdd.

context.read().json("/home/spark/articles/*.json")
// or getting json out of s3
context.read().json("s3n://bucket/articles/201510*/*.json")
like image 129
tjriggs Avatar answered Oct 19 '22 23:10

tjriggs