Slice pandas dataframe json column into columns

Tags:

I have a data frame which have two columns in JSON format, like this:

   author          biblio                                 series                             
Mehrdad Vahabi    {'volume': 68, 'month': 'January',      {'handle':'RePEc:aka:aoecon', 'name': 'Oeconomica'}
                   'name': 'János Kornai', 
                   'issue': 's', 'handle': 
                   'n:v:68:y:2018:i', 
                   'year': '2018', 
                   'pages': '27-52', 'doi': ''}
Michael Bailey     {'c_date': '2017', 'number':           {'handle': '', 'name': ''}
                    '23608', 'handle': 'RePEc:nbr:
                    nberwo:23608', 'name': 'Measuring'}

I Want to my data frame looks like this:

Click to copy

author          biblio.volume  biblio.month    biblio.name  biblio.issue  biblio.handle    bibilio.year   biblio.pages   biblio.doi  biblio.c_date    bibi¡lio.number    series.handle   series.name
Mehrdad Vahabi  68             January         János Kornai s             n:v:68:y:2018:i  2018           27-52                      NA               NA                RePEc:aka:aoecon  Oeconomica
Michael Bailey  NA             Na              Meausuring   NA            nberwo:23608     NA             NA             NA          2017        23608

I try do it using the answers in this question, but no one works for me.

How can I do it?

[EDIT] Here is a sample of the data

[EDIT]

Following the @jezrael solution I get this:

Click to copy

df1 = pd.DataFrame(df['biblio'].values.tolist())
df1.columns = 'biblio.'+ df1.columns

df2 = pd.DataFrame(df['series'].values.tolist())
df2.columns = 'series.'+ df2.columns

col = df.columns.difference(['biblio','series'])
df = pd.concat([df[col], df1, df2],axis=1)
print (df)


Traceback (most recent call last):
  File "dfs.py", line 8, in <module>
    df1.columns = 'bibliographic.'+ df1.columns
  File "/Users/danielotero/anaconda3/lib/python3.6/site-
packages/pandas/core/indexes/range.py", line 583, in _evaluate_numeric_binop
    other = self._validate_for_numeric_binop(other, op, opstr)
  File "/Users/danielotero/anaconda3/lib/python3.6/site-
packages/pandas/core/indexes/base.py", line 3961, in 
_validate_for_numeric_binop
     raise TypeError("can only perform ops with scalar values")
TypeError: can only perform ops with scalar values

And with json_normalize:

Click to copy

Traceback (most recent call last):
  File "/Users/danielotero/anaconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2525, in get_loc
    return self._engine.get_loc(key)
  File "pandas/_libs/index.pyx", line 117, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1265, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1273, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "dfs.py", line 7, in <module>
    df = json_normalize(d)
  File "/Users/danielotero/anaconda3/lib/python3.6/site-packages/pandas/io/json/normalize.py", line 192, in json_normalize
    if any([isinstance(x, dict) for x in compat.itervalues(data[0])]):
  File "/Users/danielotero/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2139, in __getitem__
    return self._getitem_column(key)
  File "/Users/danielotero/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2146, in _getitem_column
    return self._get_item_cache(key)
  File "/Users/danielotero/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py", line 1842, in _get_item_cache
    values = self._data.get(item)
  File "/Users/danielotero/anaconda3/lib/python3.6/site-packages/pandas/core/internals.py", line 3843, in get
    loc = self.items.get_loc(item)
  File "/Users/danielotero/anaconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2527, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/_libs/index.pyx", line 117, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1265, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1273, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 0

Following the @Jhon H solution, I get this:

Click to copy

Traceback (most recent call last):
  File "dfs.py", line 7, in <module>
    jsonSeries = df[['bibliographic']].tolist()
  File "/Users/danielotero/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py", line 3614, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'tolist'

759

asked Mar 03 '18 05:03

Moses

2 Answers

Create for each dict column new DataFrame by constructor and last concat all together:

Click to copy

df1 = pd.DataFrame(df['biblio'].values.tolist())
df1.columns = 'biblio.'+ df1.columns

df2 = pd.DataFrame(df['series'].values.tolist())
df2.columns = 'series.'+ df2.columns

col = df.columns.difference(['biblio','series'])
df = pd.concat([df[col], df1, df2],axis=1)
print (df)
           author biblio.c_date biblio.doi           biblio.handle  \
0  Mehrdad Vahabi           NaN                    n:v:68:y:2018:i   
1  Michael Bailey          2017        NaN  RePEc:nbr:nberwo:23608   

  biblio.issue biblio.month   biblio.name biblio.number biblio.pages  \
0            s      January  Janos Kornai           NaN        27-52   
1          NaN          NaN     Measuring         23608          NaN   

   biblio.volume biblio.year     series.handle series.name  
0           68.0        2018  RePEc:aka:aoecon  Oeconomica  
1            NaN         NaN

EDIT:

If input is json is possible use json_normalize:

Click to copy

from pandas.io.json import json_normalize

d = [{"author":"Mehrdad Vahabi","biblio":{"volume":68,"month":"January","name":"Janos Kornai","issue":"s","handle":"n:v:68:y:2018:i","year":"2018","pages":"27-52","doi":""},"series":{"handle":"RePEc:aka:aoecon","name":"Oeconomica"}},{"author":"Michael Bailey","biblio":{"c_date":"2017","number":"23608","handle":"RePEc:nbr:nberwo:23608","name":"Measuring"},"series":{"handle":"","name":""}}]

Click to copy

df = json_normalize(d)
print (df)
           author biblio.c_date biblio.doi           biblio.handle  \
0  Mehrdad Vahabi           NaN                    n:v:68:y:2018:i   
1  Michael Bailey          2017        NaN  RePEc:nbr:nberwo:23608   

  biblio.issue biblio.month   biblio.name biblio.number biblio.pages  \
0            s      January  Janos Kornai           NaN        27-52   
1          NaN          NaN     Measuring         23608          NaN   

   biblio.volume biblio.year     series.handle series.name  
0           68.0        2018  RePEc:aka:aoecon  Oeconomica  
1            NaN         NaN

EDIT: There is problem your dictionaries are strings, so first is necessary use ast.literal_eval for convert:

Click to copy

import ast

df = pd.read_csv('probe.csv')
#print (df)

df1 = pd.DataFrame(df['bibliographic'].apply(ast.literal_eval).values.tolist())
df1.columns = 'bibliographic.'+ df1.columns

df2 = pd.DataFrame(df['series'].apply(ast.literal_eval).values.tolist())
df2.columns = 'series.'+ df2.columns

col = df.columns.difference(['bibliographic','series'])
df = pd.concat([df[col], df1, df2],axis=1)

197

answered Sep 23 '22 01:09

jezrael

You need to process the columns individually and join them all together to get the format that you need. Here is a simple example that you could follow

Click to copy

import pandas as pd
records = [{'col1':'v1','col2':{'a1':1,'b1':1},'col3':{'c1':1,'d1':1}},
           {'col1':'v2','col2':{'a1':2,'b1':2},'col3':{'c1':2,'d1':2}}]
sample_df = pd.DataFrame(records)
sample_df


    col1      col2                col3
0   v1  {'a1': 1, 'b1': 1}  {'c1': 1, 'd1': 1}
1   v2  {'a1': 2, 'b1': 2}  {'c1': 2, 'd1': 2}


col2_expanded = sample_df.col2.apply(lambda x:pd.Series(x))
col2_expanded.columns = ['{}.{}'.format('col2',i) for i in col2_expanded]
col2_expanded


    col2.a1     col2.b1
0   1             1
1   2             2

col3_expanded = sample_df.col3.apply(lambda x:pd.Series(x))
col3_expanded.columns = ['{}.{}'.format('col3',i) for i in col3_expanded]
col3_expanded


    col3.c1     col3.d1
0   1            1
1   2            2

final = pd.concat([sample_df[['col1']],col2_expanded,col3_expanded],axis=1)
final

    col1    col2.a1     col2.b1     col3.c1     col3.d1
0   v1       1           1            1         1
1   v2       2           2            2         2

answered Sep 22 '22 01:09

vumaasha

Related questions
                            
                                Is this the right way to set a timezone with dateutil?
                            
                                Scraping data from Highcharts using selenium
                            
                                How to assign a python object (such as a dictionary) to pandas column
                            
                                TypeError: ufunc subtract cannot use operands with types dtype('<M8[ns]') and dtype('float64')
                            
                                PyMySQL Warning: (1366, "Incorrect string value: '\\xF0\\x9F\\x98\\x8D t...')
                            
                                How to find the largest objects in memory?
                            
                                Netbeans 8.2 Python Plugin
                            
                                Python: change global variable from within another file
                            
                                Find minimum distances between groups of points in 2D (fast and not too memory consuming)
                            
                                How to create and then write on a xlsx file
                            
                                How to make predictions with tf.estimator.Estimator from checkpoint?
                            
                                Intersect multiple 2D np arrays for determining zones
                            
                                How to deal with DAG lib in airflow?
                            
                                Shutdown dask workers from client or scheduler
                            
                                Cmd Windows "python" command works, but "python3" doesn't although my python version is 3.6
                            
                                React Flask Heroku App is not displaying frontend
                            
                                Modifying class __dict__ when shadowed by a property
                            
                                How to download this video using Selenium
                            
                                How do you recursively get all submodules in a python package?
                            
                                In Python 3.6, why does a negative number to the power of a fraction return nan when in a numpy array?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Slice pandas dataframe json column into columns

Tags:

python

json

python-3.x

pandas

Moses

People also ask

2 Answers

jezrael

vumaasha

Recent Activity

Donate For Us