Pandas, normalising json-per-line

Tags:

pandas

Pandas has the pandas.io.json.json_normalize method that can flatten json.

I've got a source file that contains json-per-line data (streamed to the file by a long running process). I'm not really in a position to modify what is written to that file. Here's a contrived example of JSON:

{"type": "bar", "aspect": {"Positive": 1, "Negative": 0.6}} {"type": "bar", "aspect": {"Positive": 0.6, "Negative": 1.5}}

I can read it in using the normal pandas.read_json method by passing the lines=True parameter. However I'd like it to be flattened, as if by json_normalize, as that gets it in to a really useful form e.g.

>>> json_normalize(json.loads('{"type": "bar", "aspect": {"Positive": 1, "Negative": 0.6}}')) aspect.Negative aspect.Positive type 0 0.6 1 bar

If I loop through the source, normalize and append, that's going to result in a full copy for each line I add. That's going to really hurt performance.

770

asked Sep 29 '17 04:09

Twirrim

1 Answers

You can use read_json + DataFrame constructor + add_prefix + drop + join:

df = pd.read_json('file.json', lines = True)
print (df)
                                              aspect type
0    {'Negative': 0.6000000000000001, 'Positive': 1}  bar
1  {'Negative': 1.5, 'Positive': 0.6000000000000001}  bar

df = (pd.DataFrame(df['aspect'].values.tolist())
        .add_prefix('aspect.')
        .join(df.drop('aspect', 1)))
print (df)
   aspect.Negative  aspect.Positive type
0              0.6              1.0  bar
1              1.5              0.6  bar

Or for each row call json.loads and last use json_normalize:

df = json_normalize(pd.Series(open('file.json').readlines()).apply(json.loads))
print (df)
   aspect.Negative  aspect.Positive type
0              0.6              1.0  bar
1              1.5              0.6  bar

df = json_normalize([json.loads(x) for x in open('file.json').readlines()])
print (df)

   aspect.Negative  aspect.Positive type
0              0.6              1.0  bar
1              1.5              0.6  bar

Timings in 5k rows:

In [13]: %timeit json_normalize([json.loads(x) for x in open('file.json').readlines()])
10 loops, best of 3: 112 ms per loop

In [14]: %timeit json_normalize(pd.Series(open('file.json').readlines()).apply(json.loads))
10 loops, best of 3: 117 ms per loop

In [15]: %%timeit
    ...: df = pd.read_json('file.json', lines = True)
    ...: df = (pd.DataFrame(df['aspect'].values.tolist()).add_prefix('aspect.').join(df.drop('aspect', 1)))
    ...: 
10 loops, best of 3: 30.1 ms per loop

answered Oct 02 '22 22:10

jezrael

Related questions
                            
                                Pandas: Fill NaNs with next non-NaN / # consecutive NaNs
                            
                                How to put all legend entries on one line?
                            
                                How do I use an InfiniBand network with Dask?
                            
                                Matplotlib change colormap tab20 to have three colors
                            
                                How to annotate Django view's methods?
                            
                                How to Add item to string_set on Dynamodb with Boto3
                            
                                BeautifulSoup.find_all() method not working with namespaced tags
                            
                                Python BeautifulSoup, iterating through tags and attributes
                            
                                Vim and python - jump to definition key binding
                            
                                ConfigParser - Create file if it doesn't exist
                            
                                Python decorators count function call
                            
                                Fitting a polynomial using np.polyfit in 3 dimensions
                            
                                Cannot chain find and find_all in BeautifulSoup
                            
                                Apache2 "Response header name '<!--' contains invalid characters, aborting request"
                            
                                What's the difference with opencv, python-opencv, and libopencv?
                            
                                How to iterate over this n-dimensional dataset?
                            
                                Looping over groups in a grouped dataframe
                            
                                Pass variables from Scala to Python in Databricks
                            
                                Getting labels from StringIndexer stages within pipeline in Spark (pyspark)
                            
                                Django delete cache with specific key_prefix

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With