I have a kinesis firehose delivery stream that puts data to S3. However in the data file the json objects has no separator between it. So it looks something like this,
{
  "key1" : "value1",
  "key2" : "value2"
}{
  "key1" : "value1",
  "key2" : "value2"
}
In Apache Spark I am doing this to read the data file,
df = spark.read.schema(schema).json(path, multiLine=True)
This can read only the first json object in the file and the rest neglected because there is no seperator.
How can I use resolve this issue in spark?
You can use sparkContext's wholeTextFiles api to read the json file into Tuple2(filename, whole text), parse the whole text to multiLine jsons, and then finally use sqlContext to read it as json to dataframe.
sqlContext\
    .read\
    .json(sc
          .wholeTextFiles("path to your multiline json file")
          .values()
          .flatMap(lambda x: x
                   .replace("\n", "#!#")
                   .replace("{#!# ", "{")
                   .replace("#!#}", "}")
                   .replace(",#!#", ",")
                   .split("#!#")))\
    .show()
you should get dataframe as 
+------+------+
|  key1|  key2|
+------+------+
|value1|value2|
|value1|value2|
+------+------+
You can modify the code according to your need though
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With