I've got this JSON file
{ "a": 1, "b": 2 }
which has been obtained with Python json.dump method. Now, I want to read this file into a DataFrame in Spark, using pyspark. Following documentation, I'm doing this
sc = SparkContext()
sqlc = SQLContext(sc)
df = sqlc.read.json('my_file.json')
print df.show()
The print statement spits out this though:
+---------------+ |_corrupt_record| +---------------+ | {| | "a": 1, | | "b": 2| | }| +---------------+
Anyone knows what's going on and why it is not interpreting the file correctly?
If you want to leave your JSON file as it is (without stripping new lines characters \n
), include multiLine=True
keyword argument
sc = SparkContext() sqlc = SQLContext(sc) df = sqlc.read.json('my_file.json', multiLine=True) print df.show()
You need to have one json object per row in your input file, see http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.json
If your json file looks like this it will give you the expected dataframe:
{ "a": 1, "b": 2 } { "a": 3, "b": 4 } .... df.show() +---+---+ | a| b| +---+---+ | 1| 2| | 3| 4| +---+---+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With