How to extract values from json string?

Tags:

I have a file which has bunch of columns and one column called jsonstring is of string type which has json strings in it… let's say the format is the following:

{
    "key1": "value1",
    "key2": {
        "level2key1": "level2value1",
        "level2key2": "level2value2"
    }
}

I want to parse this column something like this: jsonstring.key1,jsonstring.key2.level2key1 to return value1, level2value1

How can I do that in scala or spark sql.

265

asked Aug 30 '16 23:08

Manish Shukla

2 Answers

With Spark 2.2 you could use the function from_json which does the JSON parsing for you.

from_json(e: Column, schema: String, options: Map[String, String]): Column parses a column containing a JSON string into a StructType or ArrayType of StructTypes with the specified schema.

With the support for flattening nested columns by using * (star) that seems the best solution.

// the input dataset (just a single JSON blob)
val jsonstrings = Seq("""{
    "key1": "value1",
    "key2": {
        "level2key1": "level2value1",
        "level2key2": "level2value2"
    }
}""").toDF("jsonstring")

// define the schema of JSON messages
import org.apache.spark.sql.types._
val key2schema = new StructType()
  .add($"level2key1".string)
  .add($"level2key2".string)
val schema = new StructType()
  .add($"key1".string)
  .add("key2", key2schema)
scala> schema.printTreeString
root
 |-- key1: string (nullable = true)
 |-- key2: struct (nullable = true)
 |    |-- level2key1: string (nullable = true)
 |    |-- level2key2: string (nullable = true)

val messages = jsonstrings
  .select(from_json($"jsonstring", schema) as "json")
  .select("json.*") // <-- flattening nested fields
scala> messages.show(truncate = false)
+------+---------------------------+
|key1  |key2                       |
+------+---------------------------+
|value1|[level2value1,level2value2]|
+------+---------------------------+

scala> messages.select("key1", "key2.*").show(truncate = false)
+------+------------+------------+
|key1  |level2key1  |level2key2  |
+------+------------+------------+
|value1|level2value1|level2value2|
+------+------------+------------+

130

answered Sep 23 '22 19:09

Jacek Laskowski

You can use withColumn + udf + json4s:

import org.json4s.{DefaultFormats, MappingException}
import org.json4s.jackson.JsonMethods._
import org.apache.spark.sql.functions._

def getJsonContent(jsonstring: String): (String, String) = {
    implicit val formats = DefaultFormats
    val parsedJson = parse(jsonstring)  
    val value1 = (parsedJson \ "key1").extract[String]
    val level2value1 = (parsedJson \ "key2" \ "level2key1").extract[String]
    (value1, level2value1)
}
val getJsonContentUDF = udf((jsonstring: String) => getJsonContent(jsonstring))

df.withColumn("parsedJson", getJsonContentUDF(df("jsonstring")))

answered Sep 21 '22 19:09

linbojin

Related questions
                            
                                What is the idiomatic approach to perform elementwise addition to an Array of Arrays in Scala
                            
                                Scala: Why foldLeft can't work for an concat of two list?
                            
                                Kafka partition key not working properly‏
                            
                                Scala Slick and SQLite
                            
                                Load Spark data locally Incomplete HDFS URI
                            
                                Slick error while compiling table definitions: could not find implicit value for parameter tm
                            
                                Scala Stdin.readLine() does not seem to work as expected
                            
                                Convert scala.List[scala.Long] to List<java.util.Long>
                            
                                Scala how to sum a list of futures
                            
                                Unbound Wildcard Type
                            
                                RDD to LabeledPoint conversion
                            
                                Scala types: Class A is not equal to the T where T is: type T = A
                            
                                Find size of data stored in rdd from a text file in apache spark
                            
                                Scala: Get sum of nth element from tuple array/RDD
                            
                                Smartly deal with Option[T] in Scala
                            
                                Scala require() equivalent in Kotlin
                            
                                How to use ConcurrentHashMap computeIfAbsent() in Scala
                            
                                Scala flatMap, what are ms and e?
                            
                                "No Manifest available for Type" error
                            
                                How to call superclass constructor from child class in scala and how to do constructor chaining

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to extract values from json string?

Tags:

scala

apache-spark

apache-spark-sql

Manish Shukla

People also ask

2 Answers

Jacek Laskowski

linbojin

Recent Activity

Donate For Us