I have a kinesis firehose delivery stream that puts data to S3. However in the data file the json objects has no separator between it. So it looks something like this,
{
"key1" : "value1",
"key2" : "value2"
}{
"key1" : "value1",
"key2" : "value2"
}
In Apache Spark I am doing this to read the data file,
df = spark.read.schema(schema).json(path, multiLine=True)
This can read only the first json object in the file and the rest neglected because there is no seperator.
How can I use resolve this issue in spark?
You can use sparkContext
's wholeTextFiles
api to read the json file into Tuple2(filename, whole text)
, parse the whole text to multiLine jsons, and then finally use sqlContext
to read it as json to dataframe.
sqlContext\
.read\
.json(sc
.wholeTextFiles("path to your multiline json file")
.values()
.flatMap(lambda x: x
.replace("\n", "#!#")
.replace("{#!# ", "{")
.replace("#!#}", "}")
.replace(",#!#", ",")
.split("#!#")))\
.show()
you should get dataframe
as
+------+------+
| key1| key2|
+------+------+
|value1|value2|
|value1|value2|
+------+------+
You can modify the code according to your need though
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With