It was a bit tricky to explain the problem. I want to split a cell containing multiple string values delimited by commas into different rows. The df below is a small example but the real dataset contains up to 15 columns and 15 rows and each cell has 5 to 6 non unique string values in it that I need to separate into different rows.
How can we split the original df to the transformed df?
Original df
import pandas as pd
df = pd.DataFrame({"Privileges":['Type1','Type2','Type3'],"Super_Admin":["A1,A2,A3,A4","A1,B1,B2,B3, A4","C1,B2,C2,C3"], "Admin":["A1,A2","A1,B1,B2","B2, C1,C2"])
Index | Privileges | Super_Admin | Admin |
---|---|---|---|
0 | Type1 | A1,A2,A3,A4 | A1,A2 |
1 | Type2 | A1,B1,B2,B3, A4 | A1,B1,B2 |
2 | Type3 | C1,B2,C2,C3 | B2, C1,C2 |
Transformed df
df = pd.DataFrame({"Privileges":['Type1','Type1','Type1','Type1','Type2','Type2','Type2','Type2','Type2','Type3','Type3','Type3','Type3'],"Super_Admin":["A1","A2","A3","A4","A1", "B1","B2","B3", "A4","C1","B2","C2","C3"], "Admin":["A1","A2",'', '',"A1","B1","B2",'', '', "B2", "C1","C2", '']})
Index | Privileges | Super_Admin | Admin |
---|---|---|---|
0 | Type1 | A1 | A1 |
1 | Type1 | A2 | A2 |
2 | Type1 | A3 | -- |
3 | Type1 | A4 | -- |
4 | Type2 | A1 | A1 |
5 | Type2 | B1 | B1 |
6 | Type2 | B2 | B2 |
7 | Type2 | B3 | -- |
8 | Type2 | A4 | -- |
9 | Type3 | C1 | B2 |
10 | Type3 | B2 | C1 |
11 | Type3 | C2 | C2 |
12 | Type3 | C3 | -- |
Breakdown of the steps below:
zip all the columns
Use zip_longest to pair None
with values that do not have a pair
Combine the two lists into one, with chain
Create dataframe and forward fill on the Privileges
column
In [541]: step1 = zip(df.Privileges, df.Super_Admin, df.Admin)
In [542]: step2 = (zip_longest([first], second,last)
for first, second, last
in step1)
In [543]: step3 = chain.from_iterable(step2)
In [546]: (pd.DataFrame(step3, columns = df.columns)
.assign(Privileges = lambda df: df.Privileges.ffill())
)
Out[546]:
Privileges Super_Admin Admin
0 Type1 A1 A1
1 Type1 A2 A2
2 Type1 A3 None
3 Type1 A4 None
4 Type2 A1 A1
5 Type2 B1 B1
6 Type2 B2 B2
7 Type2 B3 None
8 Type2 A4 None
9 Type3 C1 B2
10 Type3 B2 C1
11 Type3 C2 C2
12 Type3 C3 None
To get more speed, you can move the split step into native python territory. pandas string methods are wrappers around python's string functions, as such they are not as fast as python's string functions.
This is on option.
First make Super_Admin
and Admin
to a list. This is useful to use pd.explode()
.
df['Super_Admin'] = df['Super_Admin'].apply(lambda x: x.split(','))
df['Admin'] = df['Admin'].apply(lambda x: x.split(','))
Then call explode
on both columns and fill the missing values with an empty string.
a = df.explode('Super_Admin')
b = df.explode('Admin')
for i in range(3):
short = b.loc[i,'Admin'].values
long = a.loc[i,'Admin'].values
a.loc[i,'Admin'] = np.concatenate((short, ['']*(len(long)-len(short))), axis=0)
The output looks like this:
>>> a
Privileges Super_Admin Admin
0 Type1 A1 A1
0 Type1 A2 A2
0 Type1 A3
0 Type1 A4
1 Type2 A1 A1
1 Type2 B1 B1
1 Type2 B2 B2
1 Type2 B3
1 Type2 A4
2 Type3 C1 B2
2 Type3 B2 C1
2 Type3 C2 C2
2 Type3 C3
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With