Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using pyspark, how do I read multiple JSON documents on a single line in a file into a dataframe?

Using Spark 2.3, I know I can read a file of JSON documents like this:

{'key': 'val1'}
{'key': 'val2'}

With this:

spark.json.read('filename')

How can I read the following in to a dataframe when there aren't newlines between JSON documents?

The following would be an example input.

{'key': 'val1'}{'key': 'val2'}

To be clear, I expect a dataframe with two rows (frame.count() == 2).

like image 701
Jared Avatar asked Jul 12 '18 20:07

Jared


1 Answers

Please try -

df = spark.read.json(["fileName1","fileName2"])

You can also do if you want to read all json files in the folder -

df = spark.read.json("data/*json")
like image 136
Tom Ron Avatar answered Sep 27 '22 22:09

Tom Ron