Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Splitting a cell in pandas into multiple rows

It was a bit tricky to explain the problem. I want to split a cell containing multiple string values delimited by commas into different rows. The df below is a small example but the real dataset contains up to 15 columns and 15 rows and each cell has 5 to 6 non unique string values in it that I need to separate into different rows.

How can we split the original df to the transformed df?

Original df

import pandas as pd
df = pd.DataFrame({"Privileges":['Type1','Type2','Type3'],"Super_Admin":["A1,A2,A3,A4","A1,B1,B2,B3, A4","C1,B2,C2,C3"], "Admin":["A1,A2","A1,B1,B2","B2, C1,C2"])
Index Privileges Super_Admin Admin
0 Type1 A1,A2,A3,A4 A1,A2
1 Type2 A1,B1,B2,B3, A4 A1,B1,B2
2 Type3 C1,B2,C2,C3 B2, C1,C2

Transformed df

df = pd.DataFrame({"Privileges":['Type1','Type1','Type1','Type1','Type2','Type2','Type2','Type2','Type2','Type3','Type3','Type3','Type3'],"Super_Admin":["A1","A2","A3","A4","A1", "B1","B2","B3", "A4","C1","B2","C2","C3"], "Admin":["A1","A2",'', '',"A1","B1","B2",'', '', "B2", "C1","C2", '']})
Index Privileges Super_Admin Admin
0 Type1 A1 A1
1 Type1 A2 A2
2 Type1 A3 --
3 Type1 A4 --
4 Type2 A1 A1
5 Type2 B1 B1
6 Type2 B2 B2
7 Type2 B3 --
8 Type2 A4 --
9 Type3 C1 B2
10 Type3 B2 C1
11 Type3 C2 C2
12 Type3 C3 --
like image 417
user14318465 Avatar asked Jan 25 '23 10:01

user14318465


2 Answers

Breakdown of the steps below:

zip all the columns

Use zip_longest to pair None with values that do not have a pair

Combine the two lists into one, with chain

Create dataframe and forward fill on the Privileges column

In [541]: step1 = zip(df.Privileges, df.Super_Admin, df.Admin)

In [542]: step2 = (zip_longest([first], second,last) 
                   for first, second, last 
                   in step1)

In [543]: step3 = chain.from_iterable(step2)

In [546]: (pd.DataFrame(step3, columns = df.columns)
             .assign(Privileges = lambda df: df.Privileges.ffill())
           )
Out[546]: 
   Privileges Super_Admin Admin
0       Type1          A1    A1
1       Type1          A2    A2
2       Type1          A3  None
3       Type1          A4  None
4       Type2          A1    A1
5       Type2          B1    B1
6       Type2          B2    B2
7       Type2          B3  None
8       Type2          A4  None
9       Type3          C1    B2
10      Type3          B2    C1
11      Type3          C2    C2
12      Type3          C3  None

To get more speed, you can move the split step into native python territory. pandas string methods are wrappers around python's string functions, as such they are not as fast as python's string functions.

like image 72
sammywemmy Avatar answered Jan 26 '23 23:01

sammywemmy


This is on option. First make Super_Admin and Admin to a list. This is useful to use pd.explode().

df['Super_Admin'] = df['Super_Admin'].apply(lambda x: x.split(','))
df['Admin'] = df['Admin'].apply(lambda x: x.split(','))

Then call explode on both columns and fill the missing values with an empty string.

a = df.explode('Super_Admin')
b = df.explode('Admin')
for i in range(3):
    short = b.loc[i,'Admin'].values
    long = a.loc[i,'Admin'].values
    a.loc[i,'Admin'] = np.concatenate((short, ['']*(len(long)-len(short))), axis=0)

The output looks like this:

>>> a
  Privileges Super_Admin Admin
0      Type1          A1    A1
0      Type1          A2    A2
0      Type1          A3      
0      Type1          A4      
1      Type2          A1    A1
1      Type2          B1    B1
1      Type2          B2    B2
1      Type2          B3      
1      Type2          A4      
2      Type3          C1    B2
2      Type3          B2    C1
2      Type3          C2    C2
2      Type3          C3      
like image 34
mosc9575 Avatar answered Jan 26 '23 22:01

mosc9575