Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Retrieving JSON objects from a text file (using Python)

I have thousands of text files containing multiple JSON objects, but unfortunately there is no delimiter between the objects. Objects are stored as dictionaries and some of their fields are themselves objects. Each object might have a variable number of nested objects. Concretely, an object might look like this:

{field1: {}, field2: "some value", field3: {}, ...} 

and hundreds of such objects are concatenated without a delimiter in a text file. This means that I can neither use json.load() nor json.loads().

Any suggestion on how I can solve this problem. Is there a known parser to do this?

like image 982
Lejlek Avatar asked Jan 04 '12 16:01

Lejlek


People also ask

How do you access a JSON object in Python?

It's pretty easy to load a JSON object in Python. Python has a built-in package called json, which can be used to work with JSON data. It's done by using the JSON module, which provides us with a lot of methods which among loads() and load() methods are gonna help us to read the JSON file.

How do I create a JSON file from a text file?

Open a Text editor like Notepad, Visual Studio Code, Sublime, or your favorite one. Copy and Paste below JSON data in Text Editor or create your own base on the What is JSON article. Once file data are validated, save the file with the extension of . json, and now you know how to create the Valid JSON document.


2 Answers

This decodes your "list" of JSON Objects from a string:

from json import JSONDecoder

def loads_invalid_obj_list(s):
    decoder = JSONDecoder()
    s_len = len(s)

    objs = []
    end = 0
    while end != s_len:
        obj, end = decoder.raw_decode(s, idx=end)
        objs.append(obj)

    return objs

The bonus here is that you play nice with the parser. Hence it keeps telling you exactly where it found an error.

Examples

>>> loads_invalid_obj_list('{}{}')
[{}, {}]

>>> loads_invalid_obj_list('{}{\n}{')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "decode.py", line 9, in loads_invalid_obj_list
    obj, end = decoder.raw_decode(s, idx=end)
  File     "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 376, in raw_decode
    obj, end = self.scan_once(s, idx)
ValueError: Expecting object: line 2 column 2 (char 5)

Clean Solution (added later)

import json
import re

#shameless copy paste from json/decoder.py
FLAGS = re.VERBOSE | re.MULTILINE | re.DOTALL
WHITESPACE = re.compile(r'[ \t\n\r]*', FLAGS)

class ConcatJSONDecoder(json.JSONDecoder):
    def decode(self, s, _w=WHITESPACE.match):
        s_len = len(s)

        objs = []
        end = 0
        while end != s_len:
            obj, end = self.raw_decode(s, idx=_w(s, end).end())
            end = _w(s, end).end()
            objs.append(obj)
        return objs

Examples

>>> print json.loads('{}', cls=ConcatJSONDecoder)
[{}]

>>> print json.load(open('file'), cls=ConcatJSONDecoder)
[{}]

>>> print json.loads('{}{} {', cls=ConcatJSONDecoder)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 339, in loads
    return cls(encoding=encoding, **kw).decode(s)
  File "decode.py", line 15, in decode
    obj, end = self.raw_decode(s, idx=_w(s, end).end())
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 376, in raw_decode
    obj, end = self.scan_once(s, idx)
ValueError: Expecting object: line 1 column 5 (char 5)
like image 113
tback Avatar answered Oct 08 '22 11:10

tback


Sebastian Blask has the right idea, but there's no reason to use regexes for such a simple change.

objs = json.loads("[%s]"%(open('your_file.name').read().replace('}{', '},{')))

Or, more legibly

raw_objs_string = open('your_file.name').read() #read in raw data
raw_objs_string = raw_objs_string.replace('}{', '},{') #insert a comma between each object
objs_string = '[%s]'%(raw_objs_string) #wrap in a list, to make valid json
objs = json.loads(objs_string) #parse json
like image 30
Patrick Perini Avatar answered Oct 08 '22 09:10

Patrick Perini