Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read invalid JSON format amazon firehose

I've got this most horrible scenario in where i want to read the files that kinesis firehose creates on our S3.

Kinesis firehose creates files that don't have every json object on a new line, but simply a json object concatenated file.

{"param1":"value1","param2":numericvalue2,"param3":"nested {bracket}"}{"param1":"value1","param2":numericvalue2,"param3":"nested {bracket}"}{"param1":"value1","param2":numericvalue2,"param3":"nested {bracket}"}

Now is this a scenario not supported by normal JSON.parse and i have tried working with following regex: .scan(/({((\".?\":.?)*?)})/)

But the scan only works in scenario's without nested brackets it seems.

Does anybody know an working/better/more elegant way to solve this problem?

like image 224
Spons Avatar asked Oct 18 '25 22:10

Spons


2 Answers

The one in the initial anwser is for unquoted jsons which happens some times. this one:

({((\\?\".*?\\?\")*?)})

Works for quoted jsons and unquoted jsons

Besides this improved it a bit, to keep it simpler.. as you can have integer and normal values.. anything within string literals will be ignored due too the double capturing group.

https://regex101.com/r/kPSc0i/1

like image 74
Spons Avatar answered Oct 20 '25 11:10

Spons


Modify the input to be one large JSON array, then parse that:

input = File.read("input.json")
json = "[#{input.rstrip.gsub(/\}\s*\{/, '},{')}]"
data = JSON.parse(json)

You might want to combine the first two to save some memory:

json = "[#{File.read('input.json').rstrip.gsub(/\}\s*\{/, '},{')}]"
data = JSON.parse(json)

This assumes that } followed by some whitespace followed by { never occurs inside a key or value in your JSON encoded data.

like image 34
Lars Haugseth Avatar answered Oct 20 '25 13:10

Lars Haugseth