JSON file parsing in Pyspark

Tags:

I am very new to Pyspark. I tried parsing the JSON file using the following code

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.json("file:///home/malwarehunter/Downloads/122116-path.json")
df.printSchema()

The output is as follows.

root |-- _corrupt_record: string (nullable = true)

df.show()

The output looks like this

+--------------------+
|     _corrupt_record|
+--------------------+
|                   {|
|  "time1":"2...|
|  "time2":"201...|
|    "step":0.5,|
|          "xyz":[|
|                   {|
|      "student":"00010...|
|      "attr...|
|        [ -2.52, ...|
|        [ -2.3, -...|
|        [ -1.97, ...|
|        [ -1.27, ...|
|        [ -1.03, ...|
|        [ -0.8, -...|
|        [ -0.13, ...|
|        [ 0.09, -...|
|        [ 0.54, -...|
|        [  1.1, -...|
|        [ 1.34, 0...|
|        [ 1.64, 0...|
+--------------------+
only showing top 20 rows

The Json File looks like this.

{
  "time1":"2016-12-16T00:00:00.000",

  "time2":"2016-12-16T23:59:59.000",

  "step":0.5,

   "xyz":[

    {
     "student":"0001025D0007F5DB",
      "attr":[
    [ -2.52, -1.17 ],
    [ -2.3, -1.15 ],
    [ -1.97, -1.19 ],
    [ 10.16, 4.08 ],
    [ 10.23, 4.87 ],
    [ 9.96, 5.09 ] ]
},
{
  "student":"0001025D0007F5DC",
  "attr":[
    [ -2.58, -0.99 ],
    [ 10.12, 3.89 ],
    [ 10.27, 4.59 ],
    [ 10.05, 5.02 ] ]
}
]}

Could you help me on parsing this and creating a Data Frame like this.

Output Dataframe required

497

asked Jan 09 '17 08:01

2 Answers

Spark >= 2.2:

You can use multiLine argument for JSON reader:

spark.read.json(path_to_input, multiLine=True)

Spark < 2.2

There is almost universal, but rather expensive solution, which can be used to read multiline JSON files:

Read data using SparkContex.wholeTextFiles.
Drop keys (file names).
Pass the result to the DataFrameReader.json.

As long as there are no other problems with your data it should do the trick:

spark.read.json(sc.wholeTextFiles(path_to_input).values())

150

answered Oct 14 '22 23:10

I experienced a similar issue. When Spark is reading the Json file, it expects each line to be a separate JSON object. So it will fail if you will try to load a pretty formatted JSON file. My walk around it was to minify the JSON file that Spark was reading.

answered Oct 14 '22 21:10

Elsis

Related questions
                            
                                SparkConf settings not used when running Spark app in cluster mode on YARN
                            
                                Apache Spark subtract days from timestamp column
                            
                                pyspark throws TypeError: textFile() missing 1 required positional argument: 'name'
                            
                                Saving dataframe records in a tab delimited file
                            
                                How to extract number from string column?
                            
                                In pyspark, is it possible to fillna with another column?
                            
                                filter only not empty arrays dataframe spark [duplicate]
                            
                                How to set up mesos for running spark on standalone OS/X
                            
                                Ungrouping a (key, list(values)) pair in Spark/Scala
                            
                                Filter out rows with NaN values for certain column
                            
                                How to connect to Amazon Redshift or other DB's in Apache Spark?
                            
                                Spark Shell stuck in YARN Accepted state
                            
                                Calculate a grouped median in pyspark
                            
                                spark scala : Convert Array of Struct column to String column
                            
                                spark select and add columns with alias
                            
                                What does withReplacement do, if specified for sample against a Spark Dataframe
                            
                                Apache Spark: dealing with Option/Some/None in RDDs
                            
                                How to access local files in Spark on Windows?
                            
                                GenericRowWithSchema exception in casting ArrayBuffer to HashSet in DataFrame to RDD from Hive table
                            
                                Concatenate Sparse Vectors in Spark?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

JSON file parsing in Pyspark

Tags:

dataframe

apache-spark

apache-spark-sql

pyspark

pyspark-sql

Jil Jung Juk

People also ask

2 Answers

zero323

Elsis

Recent Activity

Donate For Us