Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyspark: Reading JSON data file with no separator between objects

I have a kinesis firehose delivery stream that puts data to S3. However in the data file the json objects has no separator between it. So it looks something like this,

{
  "key1" : "value1",
  "key2" : "value2"
}{
  "key1" : "value1",
  "key2" : "value2"
}

In Apache Spark I am doing this to read the data file,

df = spark.read.schema(schema).json(path, multiLine=True)

This can read only the first json object in the file and the rest neglected because there is no seperator.

How can I use resolve this issue in spark?

like image 782
sjishan Avatar asked Mar 07 '23 04:03

sjishan


1 Answers

You can use sparkContext's wholeTextFiles api to read the json file into Tuple2(filename, whole text), parse the whole text to multiLine jsons, and then finally use sqlContext to read it as json to dataframe.

sqlContext\
    .read\
    .json(sc
          .wholeTextFiles("path to your multiline json file")
          .values()
          .flatMap(lambda x: x
                   .replace("\n", "#!#")
                   .replace("{#!# ", "{")
                   .replace("#!#}", "}")
                   .replace(",#!#", ",")
                   .split("#!#")))\
    .show()

you should get dataframe as

+------+------+
|  key1|  key2|
+------+------+
|value1|value2|
|value1|value2|
+------+------+

You can modify the code according to your need though

like image 132
Ramesh Maharjan Avatar answered Mar 10 '23 08:03

Ramesh Maharjan