_corrupt_record error when reading a JSON file into Spark

Question

I've got this JSON file

{     "a": 1,      "b": 2 }

which has been obtained with Python json.dump method. Now, I want to read this file into a DataFrame in Spark, using pyspark. Following documentation, I'm doing this

sc = SparkContext()

sqlc = SQLContext(sc)

df = sqlc.read.json('my_file.json')

print df.show()

The print statement spits out this though:

+---------------+ |_corrupt_record| +---------------+ |              {| |       "a": 1, | |         "b": 2| |              }| +---------------+

Anyone knows what's going on and why it is not interpreting the file correctly?

wiggy · Accepted Answer

If you want to leave your JSON file as it is (without stripping new lines characters ), include multiLine=True keyword argument

sc = SparkContext()  sqlc = SQLContext(sc)  df = sqlc.read.json('my_file.json', multiLine=True)  print df.show()

Bernhard · Answer

You need to have one json object per row in your input file, see http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.json

If your json file looks like this it will give you the expected dataframe:

{ "a": 1, "b": 2 } { "a": 3, "b": 4 }  .... df.show() +---+---+ |  a|  b| +---+---+ |  1|  2| |  3|  4| +---+---+

_corrupt_record error when reading a JSON file into Spark

Tags:

python

json

dataframe

pyspark

mar tin

2 Answers

wiggy

Bernhard

Recent Activity

Donate For Us

_corrupt_record error when reading a JSON file into Spark

Tags:

python

json

dataframe

pyspark

mar tin

2 Answers

wiggy

Bernhard

Related questions

Recent Activity

Donate For Us