Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas, normalising json-per-line

Tags:

python

pandas

Pandas has the pandas.io.json.json_normalize method that can flatten json.

I've got a source file that contains json-per-line data (streamed to the file by a long running process). I'm not really in a position to modify what is written to that file. Here's a contrived example of JSON:

{"type": "bar", "aspect": {"Positive": 1, "Negative": 0.6}} {"type": "bar", "aspect": {"Positive": 0.6, "Negative": 1.5}}

I can read it in using the normal pandas.read_json method by passing the lines=True parameter. However I'd like it to be flattened, as if by json_normalize, as that gets it in to a really useful form e.g.

>>> json_normalize(json.loads('{"type": "bar", "aspect": {"Positive": 1, "Negative": 0.6}}')) aspect.Negative aspect.Positive type 0 0.6 1 bar

If I loop through the source, normalize and append, that's going to result in a full copy for each line I add. That's going to really hurt performance.

like image 770
Twirrim Avatar asked Sep 29 '17 04:09

Twirrim


People also ask

What is JSON normalize?

A simple way to traverse datasets based on JSON API specification. Normalize is a lightweight javascript library with simple and powerful api.

Which pandas version has json_normalize?

json_normalize — pandas 1.2. 2 documentation.

What does JSON loads do in Python?

The json. load() is used to read the JSON document from file and The json. loads() is used to convert the JSON String document into the Python dictionary. fp file pointer used to read a text file, binary file or a JSON file that contains a JSON document.


1 Answers

You can use read_json + DataFrame constructor + add_prefix + drop + join:

df = pd.read_json('file.json', lines = True)
print (df)
                                              aspect type
0    {'Negative': 0.6000000000000001, 'Positive': 1}  bar
1  {'Negative': 1.5, 'Positive': 0.6000000000000001}  bar

df = (pd.DataFrame(df['aspect'].values.tolist())
        .add_prefix('aspect.')
        .join(df.drop('aspect', 1)))
print (df)
   aspect.Negative  aspect.Positive type
0              0.6              1.0  bar
1              1.5              0.6  bar

Or for each row call json.loads and last use json_normalize:

df = json_normalize(pd.Series(open('file.json').readlines()).apply(json.loads))
print (df)
   aspect.Negative  aspect.Positive type
0              0.6              1.0  bar
1              1.5              0.6  bar

df = json_normalize([json.loads(x) for x in open('file.json').readlines()])
print (df)

   aspect.Negative  aspect.Positive type
0              0.6              1.0  bar
1              1.5              0.6  bar

Timings in 5k rows:

In [13]: %timeit json_normalize([json.loads(x) for x in open('file.json').readlines()])
10 loops, best of 3: 112 ms per loop

In [14]: %timeit json_normalize(pd.Series(open('file.json').readlines()).apply(json.loads))
10 loops, best of 3: 117 ms per loop

In [15]: %%timeit
    ...: df = pd.read_json('file.json', lines = True)
    ...: df = (pd.DataFrame(df['aspect'].values.tolist()).add_prefix('aspect.').join(df.drop('aspect', 1)))
    ...: 
10 loops, best of 3: 30.1 ms per loop
like image 57
jezrael Avatar answered Oct 02 '22 22:10

jezrael