Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read in-memory JSON string into Spark DataFrame

I'm trying to read an in-memory JSON string into a Spark DataFrame on the fly:

var someJSON : String = getJSONSomehow()
val someDF : DataFrame = magic.convert(someJSON)

I've spent quite a bit of time looking at the Spark API, and the best I can find is to use a sqlContext like so:

var someJSON : String = getJSONSomehow()
val tmpFile : Output = Resource
    .fromFile(s"/tmp/json/${UUID.randomUUID().toString()}")
tmpFile.write("hello")(Codec.UTF8)
val someDF : DataFrame = sqlContext.read().json(tmpFile)

But this feels kind of awkward/wonky and imposes the following constraints:

  1. It requires me to format my JSON to one object per line (per documentation); and
  2. It forces me to write the JSON to a temp file, which is slow and awkward; and
  3. It forces me to clean up temp files over time, which is cumbersome and feels "wrong" to me

So I ask: Is there a direct and more efficient way to convert a JSON string into a Spark DataFrame?

like image 653
smeeb Avatar asked Sep 21 '16 14:09

smeeb


People also ask

How to read and parse JSON and convert to Dataframe in spark?

Assume you have a text file with a JSON data or a CSV file with a JSON string in a column, In order to read these files and parse JSON and convert to DataFrame, we use from_json () function provided in Spark SQL. 1. Read and Parse a JSON from a TEXT file

How to read small in memory JSON string in spark?

It can be used for processing small in memory JSON string. The following sample JSON string will be used. It is a simple JSON array with three items in the array. For each item, there are two attributes named ID and ATTR1 with data type as integer and string respectively. In Spark, DataFrameReader object can be used to read JSON.

How do I write to a JSON file from a Dataframe?

Write Spark DataFrame to JSON file Use the Spark DataFrameWriter object “write” method on DataFrame to write a JSON file. While writing a JSON file you can use several options. Spark DataFrameWriter also has a method mode () to specify SaveMode; the argument to this method either takes below string or a constant from SaveMode class.

How to read multiple JSON files from different paths in spark?

Using spark.read.option ("multiline","true") 3. Reading Multiple Files at a Time Using the spark.read.json () method you can also read multiple JSON files from different paths, just pass all file names with fully qualified paths by separating comma, for example 4. Reading all Files in a Directory


1 Answers

From Spark SQL guide:

val otherPeopleRDD = spark.sparkContext.makeRDD(
"""{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}""" :: Nil)
val otherPeople = spark.read.json(otherPeopleRDD)
otherPeople.show()

This creates a DataFrame from an intermediate RDD (created by passing a String).

like image 173
bear911 Avatar answered Sep 29 '22 12:09

bear911