how to read json with schema in spark dataframes/spark sql

Tags:

sql/dataframes, please help me out or provide some good suggestion on how to read this json

{
    "billdate":"2016-08-08',
    "accountid":"xxx"
    "accountdetails":{
        "total":"1.1"
        "category":[
        {
            "desc":"one",
            "currentinfo":{
            "value":"10"
        },
            "subcategory":[
            {
                "categoryDesc":"sub",
                "value":"10",
                "currentinfo":{
                    "value":"10"
                }
            }]
        }]
    }
}

Thanks,

818

asked Sep 06 '16 17:09

raj kumar

1 Answers

Seems like your json is not valid. pls check with http://www.jsoneditoronline.org/

Please see an-introduction-to-json-support-in-spark-sql.html

if you want to register as the table you can register like below and print the schema.

DataFrame df = sqlContext.read().json("/path/to/validjsonfile").toDF();
    df.registerTempTable("df");
    df.printSchema();

Below is sample code snippet

DataFrame app = df.select("toplevel");
        app.registerTempTable("toplevel");
        app.printSchema();
        app.show();
DataFrame appName = app.select("toplevel.sublevel");
        appName.registerTempTable("sublevel");
        appName.printSchema();
        appName.show();

Example with scala :

{"name":"Michael", "cities":["palo alto", "menlo park"], "schools":[{"sname":"stanford", "year":2010}, {"sname":"berkeley", "year":2012}]}
{"name":"Andy", "cities":["santa cruz"], "schools":[{"sname":"ucsb", "year":2011}]}
{"name":"Justin", "cities":["portland"], "schools":[{"sname":"berkeley", "year":2014}]}

 val people = sqlContext.read.json("people.json")
people: org.apache.spark.sql.DataFrame

Reading top level field

val names = people.select('name).collect()
names: Array[org.apache.spark.sql.Row] = Array([Michael], [Andy], [Justin])

 names.map(row => row.getString(0))
res88: Array[String] = Array(Michael, Andy, Justin)

Use the select() method to specify the top-level field, collect() to collect it into an Array[Row], and the getString() method to access a column inside each Row.

Flatten and Read a JSON Array

each Person has an array of "cities". Let's flatten these arrays and read out all their elements.

val flattened = people.explode("cities", "city"){c: List[String] => c}
flattened: org.apache.spark.sql.DataFrame

val allCities = flattened.select('city).collect()
allCities: Array[org.apache.spark.sql.Row]

 allCities.map(row => row.getString(0))
res92: Array[String] = Array(palo alto, menlo park, santa cruz, portland)

The explode() method explodes, or flattens, the cities array into a new column named "city". We then use select() to select the new column, collect() to collect it into an Array[Row], and getString() to access the data inside each Row.

Read an Array of Nested JSON Objects, Unflattened

read out the "schools" data, which is an array of nested JSON objects. Each element of the array holds the school name and year:

 val schools = people.select('schools).collect()
schools: Array[org.apache.spark.sql.Row]


val schoolsArr = schools.map(row => row.getSeq[org.apache.spark.sql.Row](0))
schoolsArr: Array[Seq[org.apache.spark.sql.Row]]

 schoolsArr.foreach(schools => {
    schools.map(row => print(row.getString(0), row.getLong(1)))
    print("\n")
 })
(stanford,2010)(berkeley,2012) 
(ucsb,2011) 
(berkeley,2014)

Use select() and collect() to select the "schools" array and collect it into an Array[Row]. Now, each "schools" array is of type List[Row], so we read it out with the getSeq[Row]() method. Finally, we can read the information for each individual school, by calling getString() for the school name and getLong() for the school year.

179

answered Oct 15 '22 19:10

Ram Ghadiyaram

Related questions
                            
                                Why is the main function not running in the REPL?
                            
                                Parsing very large xml lazily
                            
                                Parameterized logging in slf4j - how does it compare to scala's by-name parameters?
                            
                                Scala: split string via pattern matching
                            
                                How do I sbt test continually until a flakey test fails?
                            
                                Listening to Akka-HTTP ports remotely
                            
                                Spark Select with a List of Columns Scala
                            
                                Scala insert into list at specific locations
                            
                                If-statement scoped variables
                            
                                Scala best practices: simple Option[] usage
                            
                                How to create a synchronized object method in scala
                            
                                Removing punctuation marks form text in Scala - Spark
                            
                                Spark: Joining with array
                            
                                Is there an equivalent in Scala to Python's more general map function?
                            
                                Select certain super class for method call in Scala
                            
                                Setting HTTP headers in Play 2.0 (scala)?
                            
                                Where did Option[T] come from in Scala?
                            
                                How do I remove trailing elements in a Scala collection?
                            
                                In these cases, the Scala value class will be "boxed", right?
                            
                                How to generate TimeUUID in Java/Scala

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

how to read json with schema in spark dataframes/spark sql

Tags:

dataframe

scala

apache-spark

apache-spark-sql