I'd like to know if there is a memory efficient way of reading multi record JSON file ( each line is a JSON dict) into a pandas dataframe. Below is a 2 line example with working solution, I need it for potentially very large number of records. Example use would be to process output from Hadoop Pig JSonStorage function.
import json import pandas as pd test='''{"a":1,"b":2} {"a":3,"b":4}''' #df=pd.read_json(test,orient='records') doesn't work, expects [] l=[ json.loads(l) for l in test.splitlines()] df=pd.DataFrame(l)
Step 1: Load the nested json file with the help of json. load() method. Step 2: Flatten the different column values using pandas methods. Step 3: Convert the flattened dataframe into CSV file.
Reading JSON Files using Pandas To read the files, we use read_json() function and through it, we pass the path to the JSON file we want to read. Once we do that, it returns a “DataFrame”( A table of rows and columns) that stores data.
pandas filesystem APIs make it easy to load multiple files stored in a single directory or in nested directories. Other Python libraries can even make this easier and more scalable.
Note: Line separated json is now supported in read_json
(since 0.19.0):
In [31]: pd.read_json('{"a":1,"b":2}\n{"a":3,"b":4}', lines=True) Out[31]: a b 0 1 2 1 3 4
or with a file/filepath rather than a json string:
pd.read_json(json_file, lines=True)
It's going to depend on the size of you DataFrames which is faster, but another option is to use str.join
to smash your multi line "JSON" (Note: it's not valid json), into valid json and use read_json:
In [11]: '[%s]' % ','.join(test.splitlines()) Out[11]: '[{"a":1,"b":2},{"a":3,"b":4}]'
For this tiny example this is slower, if around 100 it's the similar, signicant gains if it's larger...
In [21]: %timeit pd.read_json('[%s]' % ','.join(test.splitlines())) 1000 loops, best of 3: 977 µs per loop In [22]: %timeit l=[ json.loads(l) for l in test.splitlines()]; df = pd.DataFrame(l) 1000 loops, best of 3: 282 µs per loop In [23]: test_100 = '\n'.join([test] * 100) In [24]: %timeit pd.read_json('[%s]' % ','.join(test_100.splitlines())) 1000 loops, best of 3: 1.25 ms per loop In [25]: %timeit l = [json.loads(l) for l in test_100.splitlines()]; df = pd.DataFrame(l) 1000 loops, best of 3: 1.25 ms per loop In [26]: test_1000 = '\n'.join([test] * 1000) In [27]: %timeit l = [json.loads(l) for l in test_1000.splitlines()]; df = pd.DataFrame(l) 100 loops, best of 3: 9.78 ms per loop In [28]: %timeit pd.read_json('[%s]' % ','.join(test_1000.splitlines())) 100 loops, best of 3: 3.36 ms per loop
Note: of that time the join is surprisingly fast.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With