I have a file which has bunch of columns and one column called jsonstring
is of string type which has json strings in it… let's say the format is the following:
{
"key1": "value1",
"key2": {
"level2key1": "level2value1",
"level2key2": "level2value2"
}
}
I want to parse this column something like this: jsonstring.key1,jsonstring.key2.level2key1 to return value1, level2value1
How can I do that in scala or spark sql.
Getting a specific property from a JSON response object Instead, you select the exact property you want and pull that out through dot notation. The dot ( . ) after response (the name of the JSON payload, as defined arbitrarily in the jQuery AJAX function) is how you access the values you want from the JSON object.
To query JSON data, you can use standard T-SQL. If you must create a query or report on JSON data, you can easily convert JSON data to rows and columns by calling the OPENJSON rowset function. For more information, see Convert JSON Data to Rows and Columns with OPENJSON (SQL Server).
Use the JavaScript function JSON.stringify() to convert it into a string. const myJSON = JSON.stringify(obj); The result will be a string following the JSON notation.
With Spark 2.2 you could use the function from_json which does the JSON parsing for you.
from_json(e: Column, schema: String, options: Map[String, String]): Column parses a column containing a JSON string into a
StructType
orArrayType
ofStructTypes
with the specified schema.
With the support for flattening nested columns by using *
(star) that seems the best solution.
// the input dataset (just a single JSON blob)
val jsonstrings = Seq("""{
"key1": "value1",
"key2": {
"level2key1": "level2value1",
"level2key2": "level2value2"
}
}""").toDF("jsonstring")
// define the schema of JSON messages
import org.apache.spark.sql.types._
val key2schema = new StructType()
.add($"level2key1".string)
.add($"level2key2".string)
val schema = new StructType()
.add($"key1".string)
.add("key2", key2schema)
scala> schema.printTreeString
root
|-- key1: string (nullable = true)
|-- key2: struct (nullable = true)
| |-- level2key1: string (nullable = true)
| |-- level2key2: string (nullable = true)
val messages = jsonstrings
.select(from_json($"jsonstring", schema) as "json")
.select("json.*") // <-- flattening nested fields
scala> messages.show(truncate = false)
+------+---------------------------+
|key1 |key2 |
+------+---------------------------+
|value1|[level2value1,level2value2]|
+------+---------------------------+
scala> messages.select("key1", "key2.*").show(truncate = false)
+------+------------+------------+
|key1 |level2key1 |level2key2 |
+------+------------+------------+
|value1|level2value1|level2value2|
+------+------------+------------+
You can use withColumn + udf + json4s:
import org.json4s.{DefaultFormats, MappingException}
import org.json4s.jackson.JsonMethods._
import org.apache.spark.sql.functions._
def getJsonContent(jsonstring: String): (String, String) = {
implicit val formats = DefaultFormats
val parsedJson = parse(jsonstring)
val value1 = (parsedJson \ "key1").extract[String]
val level2value1 = (parsedJson \ "key2" \ "level2key1").extract[String]
(value1, level2value1)
}
val getJsonContentUDF = udf((jsonstring: String) => getJsonContent(jsonstring))
df.withColumn("parsedJson", getJsonContentUDF(df("jsonstring")))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With