Pyspark: Reading JSON data file with no separator between objects

Question

I have a kinesis firehose delivery stream that puts data to S3. However in the data file the json objects has no separator between it. So it looks something like this,

{
  "key1" : "value1",
  "key2" : "value2"
}{
  "key1" : "value1",
  "key2" : "value2"
}

In Apache Spark I am doing this to read the data file,

df = spark.read.schema(schema).json(path, multiLine=True)

This can read only the first json object in the file and the rest neglected because there is no seperator.

How can I use resolve this issue in spark?

Ramesh Maharjan · Accepted Answer

You can use sparkContext's wholeTextFiles api to read the json file into Tuple2(filename, whole text), parse the whole text to multiLine jsons, and then finally use sqlContext to read it as json to dataframe.

sqlContext\
    .read\
    .json(sc
          .wholeTextFiles("path to your multiline json file")
          .values()
          .flatMap(lambda x: x
                   .replace("
", "#!#")
                   .replace("{#!# ", "{")
                   .replace("#!#}", "}")
                   .replace(",#!#", ",")
                   .split("#!#")))\
    .show()

you should get dataframe as

+------+------+
|  key1|  key2|
+------+------+
|value1|value2|
|value1|value2|
+------+------+

You can modify the code according to your need though

Pyspark: Reading JSON data file with no separator between objects

Tags:

json

apache-spark

pyspark

amazon-kinesis-firehose

databricks

sjishan

1 Answers

Ramesh Maharjan

Recent Activity

Donate For Us

Pyspark: Reading JSON data file with no separator between objects

Tags:

json

apache-spark

pyspark

amazon-kinesis-firehose

databricks

sjishan

1 Answers

Ramesh Maharjan

Related questions

Recent Activity

Donate For Us