I have a pandas dataframe that looks like this:
SUBJECT STUDENT CITY STATE
Math/Chemistry/Biology Sam/Peter/Mary Los Angeles CA
Geology/Physics John Boston MA
This is how it should look like:
SUBJECT STUDENT CITY STATE
Math Sam Los Angeles CA
Chemistry Peter Los Angeles CA
Biology Mary Los Angeles CA
Geology John Boston MA
Physics John Boston MA
Before asking this question, I referred to the solutions mentioned in this page: pandas: How do I split text in a column into multiple rows?
Since there are slash separated strings in two columns, I am not able to use the solutions in the above link.
Another solution with concat and join:
s1 = df.SUBJECT.str.split('/', expand=True).stack()
s2 = df.STUDENT.str.split('/', expand=True).stack()
print (s1)
0 0 Math
1 Chemistry
2 Biology
1 0 Geology
1 Physics
print (s2)
0 0 Sam
1 Peter
2 Mary
1 0 John
dtype: object
df1 = pd.concat([s1,s2], axis=1, keys=('SUBJECT','STUDENT'))
.ffill()
.reset_index(level=1, drop=True)
print (df1)
SUBJECT STUDENT
0 Math Sam
0 Chemistry Peter
0 Biology Mary
1 Geology John
1 Physics John
df = df.drop(['SUBJECT','STUDENT'], axis=1)
.join(df1)
.reset_index(drop=True)[['SUBJECT', 'STUDENT', 'CITY','STATE']]
print (df)
SUBJECT STUDENT CITY STATE
0 Math Sam Los Angeles CA
1 Chemistry Peter Los Angeles CA
2 Biology Mary Los Angeles CA
3 Geology John Boston MA
4 Physics John Boston MA
First thing, split fields by '/'
df.SUBJECT = df.SUBJECT.str.split('/')
df.STUDENT = df.STUDENT.str.split('/')
Then I use a function to explode rows. However, I had to segregate those rows that only had one student or subject.
def explode(df, columns):
idx = np.repeat(df.index, df[columns[0]].str.len())
a = df.T.reindex_axis(columns).values
concat = np.concatenate([np.concatenate(a[i]) for i in range(a.shape[0])])
p = pd.DataFrame(concat.reshape(a.shape[0], -1).T, idx, columns)
return pd.concat([df.drop(columns, axis=1), p], axis=1).reset_index(drop=True)
cond = df.STUDENT.str.len() == df.SUBJECT.str.len()
df_paired = df[cond]
df_unpard = df[~cond]
if not df_paired.empty:
df_paired = explode(df_paired, ['STUDENT','SUBJECT'])
if not df_unpard.empty:
df_unpard = explode(explode(df_unpard, ['STUDENT']), ['SUBJECT'])
Finally
pd.concat([df_paired, df_unpard], ignore_index=True)[df.columns]

piRSquared
%%timeit
df = df_.copy()
df.SUBJECT = df.SUBJECT.str.split('/')
df.STUDENT = df.STUDENT.str.split('/')
def explode(df, columns):
idx = np.repeat(df.index, df[columns[0]].str.len())
a = df.T.reindex_axis(columns).values
concat = np.concatenate([np.concatenate(a[i]) for i in range(a.shape[0])])
p = pd.DataFrame(concat.reshape(a.shape[0], -1).T, idx, columns)
return pd.concat([df.drop(columns, axis=1), p], axis=1).reset_index(drop=True)
cond = df.STUDENT.str.len() == df.SUBJECT.str.len()
df_paired = df[cond]
df_unpard = df[~cond]
if not df_paired.empty:
df_paired = explode(df_paired, ['STUDENT','SUBJECT'])
if not df_unpard.empty:
df_unpard = explode(explode(df_unpard, ['STUDENT']), ['SUBJECT'])
pd.concat([df_paired, df_unpard], ignore_index=True)[df.columns]
100 loops, best of 3: 7.76 ms per loop
jezrael
%%timeit
df = df_.copy()
s1 = df.SUBJECT.str.split('/', expand=True).stack()
s2 = df.STUDENT.str.split('/', expand=True).stack()
df1 = pd.concat([s1,s2], axis=1, keys=('SUBJECT','STUDENT')) \
.ffill() \
.reset_index(level=1, drop=True)
df.drop(['SUBJECT','STUDENT'], axis=1) \
.join(df1) \
.reset_index(drop=True)[['SUBJECT', 'STUDENT', 'CITY','STATE']]
100 loops, best of 3: 5.13 ms per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With