Pandas has the pandas.io.json.json_normalize
method that can flatten json.
I've got a source file that contains json-per-line data (streamed to the file by a long running process). I'm not really in a position to modify what is written to that file. Here's a contrived example of JSON:
{"type": "bar", "aspect": {"Positive": 1, "Negative": 0.6}}
{"type": "bar", "aspect": {"Positive": 0.6, "Negative": 1.5}}
I can read it in using the normal pandas.read_json
method by passing the lines=True
parameter. However I'd like it to be flattened, as if by json_normalize, as that gets it in to a really useful form e.g.
>>> json_normalize(json.loads('{"type": "bar", "aspect": {"Positive": 1, "Negative": 0.6}}'))
aspect.Negative aspect.Positive type
0 0.6 1 bar
If I loop through the source, normalize and append, that's going to result in a full copy for each line I add. That's going to really hurt performance.
A simple way to traverse datasets based on JSON API specification. Normalize is a lightweight javascript library with simple and powerful api.
json_normalize — pandas 1.2. 2 documentation.
The json. load() is used to read the JSON document from file and The json. loads() is used to convert the JSON String document into the Python dictionary. fp file pointer used to read a text file, binary file or a JSON file that contains a JSON document.
You can use read_json
+ DataFrame constructor
+ add_prefix
+ drop
+ join
:
df = pd.read_json('file.json', lines = True)
print (df)
aspect type
0 {'Negative': 0.6000000000000001, 'Positive': 1} bar
1 {'Negative': 1.5, 'Positive': 0.6000000000000001} bar
df = (pd.DataFrame(df['aspect'].values.tolist())
.add_prefix('aspect.')
.join(df.drop('aspect', 1)))
print (df)
aspect.Negative aspect.Positive type
0 0.6 1.0 bar
1 1.5 0.6 bar
Or for each row call json.loads
and last use json_normalize
:
df = json_normalize(pd.Series(open('file.json').readlines()).apply(json.loads))
print (df)
aspect.Negative aspect.Positive type
0 0.6 1.0 bar
1 1.5 0.6 bar
df = json_normalize([json.loads(x) for x in open('file.json').readlines()])
print (df)
aspect.Negative aspect.Positive type
0 0.6 1.0 bar
1 1.5 0.6 bar
Timings in 5k rows:
In [13]: %timeit json_normalize([json.loads(x) for x in open('file.json').readlines()])
10 loops, best of 3: 112 ms per loop
In [14]: %timeit json_normalize(pd.Series(open('file.json').readlines()).apply(json.loads))
10 loops, best of 3: 117 ms per loop
In [15]: %%timeit
...: df = pd.read_json('file.json', lines = True)
...: df = (pd.DataFrame(df['aspect'].values.tolist()).add_prefix('aspect.').join(df.drop('aspect', 1)))
...:
10 loops, best of 3: 30.1 ms per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With