Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

JSON file parsing in Pyspark

I am very new to Pyspark. I tried parsing the JSON file using the following code

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.json("file:///home/malwarehunter/Downloads/122116-path.json")
df.printSchema()

The output is as follows.

root |-- _corrupt_record: string (nullable = true)

df.show()

The output looks like this

+--------------------+
|     _corrupt_record|
+--------------------+
|                   {|
|  "time1":"2...|
|  "time2":"201...|
|    "step":0.5,|
|          "xyz":[|
|                   {|
|      "student":"00010...|
|      "attr...|
|        [ -2.52, ...|
|        [ -2.3, -...|
|        [ -1.97, ...|
|        [ -1.27, ...|
|        [ -1.03, ...|
|        [ -0.8, -...|
|        [ -0.13, ...|
|        [ 0.09, -...|
|        [ 0.54, -...|
|        [  1.1, -...|
|        [ 1.34, 0...|
|        [ 1.64, 0...|
+--------------------+
only showing top 20 rows

The Json File looks like this.

{
  "time1":"2016-12-16T00:00:00.000",

  "time2":"2016-12-16T23:59:59.000",

  "step":0.5,

   "xyz":[

    {
     "student":"0001025D0007F5DB",
      "attr":[
    [ -2.52, -1.17 ],
    [ -2.3, -1.15 ],
    [ -1.97, -1.19 ],
    [ 10.16, 4.08 ],
    [ 10.23, 4.87 ],
    [ 9.96, 5.09 ] ]
},
{
  "student":"0001025D0007F5DC",
  "attr":[
    [ -2.58, -0.99 ],
    [ 10.12, 3.89 ],
    [ 10.27, 4.59 ],
    [ 10.05, 5.02 ] ]
}
]}

Could you help me on parsing this and creating a Data Frame like this.

Output Dataframe required

like image 497
Jil Jung Juk Avatar asked Jan 09 '17 08:01

Jil Jung Juk


People also ask

How do I write a JSON file in PySpark?

Finally, the PySpark dataframe is written into JSON file using "dataframe. write. mode(). json()" function.

How do I read a JSON file in spark?

Spark SQL can automatically infer the schema of a JSON dataset and load it as a DataFrame. using the read. json() function, which loads data from a directory of JSON files where each line of the files is a JSON object. Note that the file that is offered as a json file is not a typical JSON file.

How do I read multiple JSON files in spark?

Using pyspark, if you have all the json files in the same folder, you can use df = spark. read. json('folder_path') . This instruction will load all the json files inside the folder.


2 Answers

Spark >= 2.2:

You can use multiLine argument for JSON reader:

spark.read.json(path_to_input, multiLine=True)

Spark < 2.2

There is almost universal, but rather expensive solution, which can be used to read multiline JSON files:

  • Read data using SparkContex.wholeTextFiles.
  • Drop keys (file names).
  • Pass the result to the DataFrameReader.json.

As long as there are no other problems with your data it should do the trick:

spark.read.json(sc.wholeTextFiles(path_to_input).values())
like image 150
zero323 Avatar answered Oct 14 '22 23:10

zero323


I experienced a similar issue. When Spark is reading the Json file, it expects each line to be a separate JSON object. So it will fail if you will try to load a pretty formatted JSON file. My walk around it was to minify the JSON file that Spark was reading.

like image 36
Elsis Avatar answered Oct 14 '22 21:10

Elsis