I'm trying to read an in-memory JSON string into a Spark DataFrame on the fly:
var someJSON : String = getJSONSomehow()
val someDF : DataFrame = magic.convert(someJSON)
I've spent quite a bit of time looking at the Spark API, and the best I can find is to use a sqlContext
like so:
var someJSON : String = getJSONSomehow()
val tmpFile : Output = Resource
.fromFile(s"/tmp/json/${UUID.randomUUID().toString()}")
tmpFile.write("hello")(Codec.UTF8)
val someDF : DataFrame = sqlContext.read().json(tmpFile)
But this feels kind of awkward/wonky and imposes the following constraints:
So I ask: Is there a direct and more efficient way to convert a JSON string into a Spark DataFrame?
Assume you have a text file with a JSON data or a CSV file with a JSON string in a column, In order to read these files and parse JSON and convert to DataFrame, we use from_json () function provided in Spark SQL. 1. Read and Parse a JSON from a TEXT file
It can be used for processing small in memory JSON string. The following sample JSON string will be used. It is a simple JSON array with three items in the array. For each item, there are two attributes named ID and ATTR1 with data type as integer and string respectively. In Spark, DataFrameReader object can be used to read JSON.
Write Spark DataFrame to JSON file Use the Spark DataFrameWriter object “write” method on DataFrame to write a JSON file. While writing a JSON file you can use several options. Spark DataFrameWriter also has a method mode () to specify SaveMode; the argument to this method either takes below string or a constant from SaveMode class.
Using spark.read.option ("multiline","true") 3. Reading Multiple Files at a Time Using the spark.read.json () method you can also read multiple JSON files from different paths, just pass all file names with fully qualified paths by separating comma, for example 4. Reading all Files in a Directory
From Spark SQL guide:
val otherPeopleRDD = spark.sparkContext.makeRDD(
"""{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}""" :: Nil)
val otherPeople = spark.read.json(otherPeopleRDD)
otherPeople.show()
This creates a DataFrame from an intermediate RDD (created by passing a String).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With