Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

_corrupt_record error when reading a JSON file into Spark

I've got this JSON file

{     "a": 1,      "b": 2 } 

which has been obtained with Python json.dump method. Now, I want to read this file into a DataFrame in Spark, using pyspark. Following documentation, I'm doing this

sc = SparkContext()

sqlc = SQLContext(sc)

df = sqlc.read.json('my_file.json')

print df.show()

The print statement spits out this though:

+---------------+ |_corrupt_record| +---------------+ |              {| |       "a": 1, | |         "b": 2| |              }| +---------------+ 

Anyone knows what's going on and why it is not interpreting the file correctly?

like image 682
mar tin Avatar asked Feb 15 '16 12:02

mar tin


2 Answers

If you want to leave your JSON file as it is (without stripping new lines characters \n), include multiLine=True keyword argument

sc = SparkContext()  sqlc = SQLContext(sc)  df = sqlc.read.json('my_file.json', multiLine=True)  print df.show() 
like image 186
wiggy Avatar answered Oct 12 '22 21:10

wiggy


You need to have one json object per row in your input file, see http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.json

If your json file looks like this it will give you the expected dataframe:

{ "a": 1, "b": 2 } { "a": 3, "b": 4 }  .... df.show() +---+---+ |  a|  b| +---+---+ |  1|  2| |  3|  4| +---+---+ 
like image 30
Bernhard Avatar answered Oct 12 '22 22:10

Bernhard