Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract values from json string?

I have a file which has bunch of columns and one column called jsonstring is of string type which has json strings in it… let's say the format is the following:

{
    "key1": "value1",
    "key2": {
        "level2key1": "level2value1",
        "level2key2": "level2value2"
    }
}

I want to parse this column something like this: jsonstring.key1,jsonstring.key2.level2key1 to return value1, level2value1

How can I do that in scala or spark sql.

like image 265
Manish Shukla Avatar asked Aug 30 '16 23:08

Manish Shukla


People also ask

How can I get specific data from JSON?

Getting a specific property from a JSON response object Instead, you select the exact property you want and pull that out through dot notation. The dot ( . ) after response (the name of the JSON payload, as defined arbitrarily in the jQuery AJAX function) is how you access the values you want from the JSON object.

How can I get specific data from JSON in SQL?

To query JSON data, you can use standard T-SQL. If you must create a query or report on JSON data, you can easily convert JSON data to rows and columns by calling the OPENJSON rowset function. For more information, see Convert JSON Data to Rows and Columns with OPENJSON (SQL Server).

How can I convert JSON to string?

Use the JavaScript function JSON.stringify() to convert it into a string. const myJSON = JSON.stringify(obj); The result will be a string following the JSON notation.


2 Answers

With Spark 2.2 you could use the function from_json which does the JSON parsing for you.

from_json(e: Column, schema: String, options: Map[String, String]): Column parses a column containing a JSON string into a StructType or ArrayType of StructTypes with the specified schema.

With the support for flattening nested columns by using * (star) that seems the best solution.

// the input dataset (just a single JSON blob)
val jsonstrings = Seq("""{
    "key1": "value1",
    "key2": {
        "level2key1": "level2value1",
        "level2key2": "level2value2"
    }
}""").toDF("jsonstring")

// define the schema of JSON messages
import org.apache.spark.sql.types._
val key2schema = new StructType()
  .add($"level2key1".string)
  .add($"level2key2".string)
val schema = new StructType()
  .add($"key1".string)
  .add("key2", key2schema)
scala> schema.printTreeString
root
 |-- key1: string (nullable = true)
 |-- key2: struct (nullable = true)
 |    |-- level2key1: string (nullable = true)
 |    |-- level2key2: string (nullable = true)

val messages = jsonstrings
  .select(from_json($"jsonstring", schema) as "json")
  .select("json.*") // <-- flattening nested fields
scala> messages.show(truncate = false)
+------+---------------------------+
|key1  |key2                       |
+------+---------------------------+
|value1|[level2value1,level2value2]|
+------+---------------------------+

scala> messages.select("key1", "key2.*").show(truncate = false)
+------+------------+------------+
|key1  |level2key1  |level2key2  |
+------+------------+------------+
|value1|level2value1|level2value2|
+------+------------+------------+
like image 130
Jacek Laskowski Avatar answered Sep 23 '22 19:09

Jacek Laskowski


You can use withColumn + udf + json4s:

import org.json4s.{DefaultFormats, MappingException}
import org.json4s.jackson.JsonMethods._
import org.apache.spark.sql.functions._

def getJsonContent(jsonstring: String): (String, String) = {
    implicit val formats = DefaultFormats
    val parsedJson = parse(jsonstring)  
    val value1 = (parsedJson \ "key1").extract[String]
    val level2value1 = (parsedJson \ "key2" \ "level2key1").extract[String]
    (value1, level2value1)
}
val getJsonContentUDF = udf((jsonstring: String) => getJsonContent(jsonstring))

df.withColumn("parsedJson", getJsonContentUDF(df("jsonstring")))
like image 44
linbojin Avatar answered Sep 21 '22 19:09

linbojin