I am very new to Pyspark. I tried parsing the JSON file using the following code
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.json("file:///home/malwarehunter/Downloads/122116-path.json")
df.printSchema()
The output is as follows.
root |-- _corrupt_record: string (nullable = true)
df.show()
The output looks like this
+--------------------+
| _corrupt_record|
+--------------------+
| {|
| "time1":"2...|
| "time2":"201...|
| "step":0.5,|
| "xyz":[|
| {|
| "student":"00010...|
| "attr...|
| [ -2.52, ...|
| [ -2.3, -...|
| [ -1.97, ...|
| [ -1.27, ...|
| [ -1.03, ...|
| [ -0.8, -...|
| [ -0.13, ...|
| [ 0.09, -...|
| [ 0.54, -...|
| [ 1.1, -...|
| [ 1.34, 0...|
| [ 1.64, 0...|
+--------------------+
only showing top 20 rows
The Json File looks like this.
{
"time1":"2016-12-16T00:00:00.000",
"time2":"2016-12-16T23:59:59.000",
"step":0.5,
"xyz":[
{
"student":"0001025D0007F5DB",
"attr":[
[ -2.52, -1.17 ],
[ -2.3, -1.15 ],
[ -1.97, -1.19 ],
[ 10.16, 4.08 ],
[ 10.23, 4.87 ],
[ 9.96, 5.09 ] ]
},
{
"student":"0001025D0007F5DC",
"attr":[
[ -2.58, -0.99 ],
[ 10.12, 3.89 ],
[ 10.27, 4.59 ],
[ 10.05, 5.02 ] ]
}
]}
Could you help me on parsing this and creating a Data Frame like this.
Finally, the PySpark dataframe is written into JSON file using "dataframe. write. mode(). json()" function.
Spark SQL can automatically infer the schema of a JSON dataset and load it as a DataFrame. using the read. json() function, which loads data from a directory of JSON files where each line of the files is a JSON object. Note that the file that is offered as a json file is not a typical JSON file.
Using pyspark, if you have all the json files in the same folder, you can use df = spark. read. json('folder_path') . This instruction will load all the json files inside the folder.
Spark >= 2.2:
You can use multiLine
argument for JSON reader:
spark.read.json(path_to_input, multiLine=True)
Spark < 2.2
There is almost universal, but rather expensive solution, which can be used to read multiline JSON files:
SparkContex.wholeTextFiles
.DataFrameReader.json
.As long as there are no other problems with your data it should do the trick:
spark.read.json(sc.wholeTextFiles(path_to_input).values())
I experienced a similar issue. When Spark is reading the Json file, it expects each line to be a separate JSON object. So it will fail if you will try to load a pretty formatted JSON file. My walk around it was to minify the JSON file that Spark was reading.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With