Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark read.json does not consider booleans in python

I havebeen trying to follow through an example for converting a JSON string to dataframe in spark following the official documentation here.

The following case works fine:

jsonStrings = ['{"name":"Yin","address":{"city":"Columbus","state":true}}']
otherPeopleRDD = sc.parallelize(jsonStrings)
otherPeople = spark.read.json(otherPeopleRDD)
otherPeople.show()

====== OUTPUT =======
+----------------+----+
|         address|name|
+----------------+----+
|[Columbus, true]| Yin|
+----------------+----+

root
 |-- address: struct (nullable = true)
 |    |-- city: string (nullable = true)
 |    |-- state: boolean (nullable = true)
 |-- name: string (nullable = true)

But I get an error when I try passing the boolean value True (as applicable in python):

jsonStrings = ['{"name":"Yin","address":{"city":"Columbus","state":True}}']
otherPeopleRDD = sc.parallelize(jsonStrings)
otherPeople = spark.read.json(otherPeopleRDD)
otherPeople.show()

====== OUTPUT =======
+--------------------+
|     _corrupt_record|
+--------------------+
|{"name":"Yin","ad...|
+--------------------+

root
 |-- _corrupt_record: string (nullable = true)

To give some context. I am calling a REST api to get JSON data using requests library in python. Then I get the json string calling .json() on the response. This gives me a json string where boolean values are Capitalized like in python. (true becomes True, false becomes False). I think this is the desired behavior but when passing this json to spark, it complains about the format of the JSON string as shown above.

resp = requests.get(url, params=query_str, cookies=cookie_str)
jsonString = resp.json()

I have read through the documentation and searched the web but didn't find anything that might help. Can someone please help me out here!

UPDATE: I found one possible explanation. This may be because of JSON encoding and decoding offered by json library in python. Link But that still doesn't explain why pyspark is not able to identify python json encoding.

like image 370
harshlal028 Avatar asked Apr 23 '26 20:04

harshlal028


1 Answers

Use the json module:

import json

spark_friendly_json = json.dumps(resp.json())
otherPeopleRDD = sc.parallelize(spark_friendly_json)
otherPeople = spark.read.json(otherPeopleRDD)
like image 80
Tanjin Avatar answered Apr 26 '26 09:04

Tanjin



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!