Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Pandas Expand a Column of List of Lists to Two New Column

I have a DF which looks like this.

name    id  apps
john    1   [[app1, v1], [app2, v2], [app3,v3]]
smith   2   [[app1, v1], [app4, v4]]

I want to expand the apps column such that it looks like this.

name    id  app_name    app_version
john    1   app1        v1
john    1   app2        v2
john    1   app3        v3
smith   2   app1        v1
smith   2   app4        v4

Any help is appreciated

like image 262
Imsa Avatar asked May 11 '19 23:05

Imsa


People also ask

How do you split a DataFrame list column into two columns?

To split a pandas column of lists into multiple columns, create a new dataframe by applying the tolist() function to the column. The following is the syntax. You can also pass the names of new columns resulting from the split as a list. Let's see it action with the help of an example.

How do I split a list into multiple columns in Python?

split() function is used to break up single column values into multiple columns based on a specified separator or delimiter. The Series. str. split() function is similar to the Python string split() method, but split() method works on the all Dataframe columns, whereas the Series.

How do I split a single column into multiple columns in Python?

We can use str. split() to split one column to multiple columns by specifying expand=True option. We can use str. extract() to exract multiple columns using regex expression in which multiple capturing groups are defined.

How do you split items into multiple columns in a data frame?

We can use the pandas Series. str. split() function to break up strings in multiple columns around a given separator or delimiter. It's similar to the Python string split() method but applies to the entire Dataframe column.


2 Answers

You can .apply(pd.Series) twice to get what you need as an intermediate step, then merge back to the original dataframe.

import pandas as pd

df = pd.DataFrame({
    'name': ['john', 'smith'],
    'id': [1, 2],
    'apps': [[['app1', 'v1'], ['app2', 'v2'], ['app3','v3']], 
             [['app1', 'v1'], ['app4', 'v4']]]
})

dftmp = df.apps.apply(pd.Series).T.melt().dropna()
dfapp = (dftmp.value
              .apply(pd.Series)
              .set_index(dftmp.variable)
              .rename(columns={0:'app_name', 1:'app_version'})
        )

df[['name', 'id']].merge(dfapp, left_index=True, right_index=True)
# returns:
    name  id app_name app_version
0   john   1     app1          v1
0   john   1     app2          v2
0   john   1     app3          v3
1  smith   2     app1          v1
1  smith   2     app4          v4
like image 149
James Avatar answered Nov 13 '22 06:11

James


Chain of pd.Series easy to understand, also if you would like know more methods ,check unnesting

df.set_index(['name','id']).apps.apply(pd.Series).\
         stack().apply(pd.Series).\
            reset_index(level=[0,1]).\
                rename(columns={0:'app_name',1:'app_version'})
Out[541]: 
    name  id app_name app_version
0   john   1     app1          v1
1   john   1     app2          v2
2   john   1     app3          v3
0  smith   2     app1          v1
1  smith   2     app4          v4

Method two slightly modify the function I write

def unnesting(df, explode):
    idx = df.index.repeat(df[explode[0]].str.len())
    df1 = pd.concat([
        pd.DataFrame({x: sum(df[x].tolist(),[])}) for x in explode], axis=1)
    df1.index = idx
    return df1.join(df.drop(explode, 1), how='left')

Then

yourdf=unnesting(df,['apps'])

yourdf['app_name'],yourdf['app_version']=yourdf.apps.str[0],yourdf.apps.str[1]
yourdf
Out[548]: 
         apps  id   name app_name app_version
0  [app1, v1]   1   john     app1          v1
0  [app2, v2]   1   john     app2          v2
0  [app3, v3]   1   john     app3          v3
1  [app1, v1]   2  smith     app1          v1
1  [app4, v4]   2  smith     app4          v4

Or

yourdf=unnesting(df,['apps']).reindex(columns=df.columns.tolist()+['app_name','app_version'])
yourdf[['app_name','app_version']]=yourdf.apps.tolist()
yourdf
Out[567]: 
         apps  id   name app_name app_version
0  [app1, v1]   1   john     app1          v1
0  [app2, v2]   1   john     app2          v2
0  [app3, v3]   1   john     app3          v3
1  [app1, v1]   2  smith     app1          v1
1  [app4, v4]   2  smith     app4          v4
like image 26
BENY Avatar answered Nov 13 '22 06:11

BENY