Reading multiple JSON records into a Pandas dataframe

Tags:

I'd like to know if there is a memory efficient way of reading multi record JSON file ( each line is a JSON dict) into a pandas dataframe. Below is a 2 line example with working solution, I need it for potentially very large number of records. Example use would be to process output from Hadoop Pig JSonStorage function.

import json import pandas as pd  test='''{"a":1,"b":2} {"a":3,"b":4}''' #df=pd.read_json(test,orient='records') doesn't work, expects []  l=[ json.loads(l) for l in test.splitlines()] df=pd.DataFrame(l)

215

asked Nov 17 '13 23:11

seanv507

1 Answers

Note: Line separated json is now supported in read_json (since 0.19.0):

In [31]: pd.read_json('{"a":1,"b":2}\n{"a":3,"b":4}', lines=True) Out[31]:    a  b 0  1  2 1  3  4

or with a file/filepath rather than a json string:

pd.read_json(json_file, lines=True)

It's going to depend on the size of you DataFrames which is faster, but another option is to use str.join to smash your multi line "JSON" (Note: it's not valid json), into valid json and use read_json:

In [11]: '[%s]' % ','.join(test.splitlines()) Out[11]: '[{"a":1,"b":2},{"a":3,"b":4}]'

For this tiny example this is slower, if around 100 it's the similar, signicant gains if it's larger...

In [21]: %timeit pd.read_json('[%s]' % ','.join(test.splitlines())) 1000 loops, best of 3: 977 µs per loop  In [22]: %timeit l=[ json.loads(l) for l in test.splitlines()]; df = pd.DataFrame(l) 1000 loops, best of 3: 282 µs per loop  In [23]: test_100 = '\n'.join([test] * 100)  In [24]: %timeit pd.read_json('[%s]' % ','.join(test_100.splitlines())) 1000 loops, best of 3: 1.25 ms per loop  In [25]: %timeit l = [json.loads(l) for l in test_100.splitlines()]; df = pd.DataFrame(l) 1000 loops, best of 3: 1.25 ms per loop  In [26]: test_1000 = '\n'.join([test] * 1000)  In [27]: %timeit l = [json.loads(l) for l in test_1000.splitlines()]; df = pd.DataFrame(l) 100 loops, best of 3: 9.78 ms per loop  In [28]: %timeit pd.read_json('[%s]' % ','.join(test_1000.splitlines())) 100 loops, best of 3: 3.36 ms per loop

Note: of that time the join is surprisingly fast.

178

answered Sep 18 '22 12:09

Andy Hayden

Related questions
                            
                                Ruby equivalent for Python's "try"?
                            
                                Remove default apps from Django-admin
                            
                                Last element in OrderedDict
                            
                                What's the official way of storing settings for Python programs?
                            
                                Financial technical analysis in python [closed]
                            
                                How to plot time series in python
                            
                                Programmatically searching google in Python using custom search
                            
                                Python-like list comprehension in Java
                            
                                Plotting networkx graph with node labels defaulting to node name
                            
                                Python sum, why not strings? [closed]
                            
                                How to sort OrderedDict of OrderedDict?
                            
                                Access a function variable outside the function without using "global"
                            
                                How can I tell which python implementation I'm using?
                            
                                Pandas: Return Hour from Datetime Column Directly
                            
                                Writing an mp4 video using python opencv
                            
                                Get first letter of a string from column
                            
                                Python Try Catch Block inside lambda
                            
                                Webdriver Screenshot
                            
                                Get parent class name? [duplicate]
                            
                                How do I get a python program to do nothing?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Reading multiple JSON records into a Pandas dataframe

Tags:

python

json

pandas

seanv507

People also ask

1 Answers

Andy Hayden

Recent Activity

Donate For Us