Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Pandas: Split slash separated strings in two or more columns into multiple rows

I have a pandas dataframe that looks like this:

SUBJECT                 STUDENT         CITY        STATE

Math/Chemistry/Biology  Sam/Peter/Mary  Los Angeles CA
Geology/Physics         John            Boston      MA

This is how it should look like:

SUBJECT      STUDENT    CITY           STATE

Math         Sam        Los Angeles    CA
Chemistry    Peter      Los Angeles    CA
Biology      Mary       Los Angeles    CA
Geology      John       Boston         MA
Physics      John       Boston         MA

Before asking this question, I referred to the solutions mentioned in this page: pandas: How do I split text in a column into multiple rows?

Since there are slash separated strings in two columns, I am not able to use the solutions in the above link.

like image 859
KeyboardWarrior Avatar asked Apr 09 '26 08:04

KeyboardWarrior


2 Answers

Another solution with concat and join:

s1 = df.SUBJECT.str.split('/', expand=True).stack()
s2 = df.STUDENT.str.split('/', expand=True).stack()
print (s1)
0  0         Math
   1    Chemistry
   2      Biology
1  0      Geology
   1      Physics

print (s2)
0  0      Sam
   1    Peter
   2     Mary
1  0     John
dtype: object
df1 = pd.concat([s1,s2], axis=1, keys=('SUBJECT','STUDENT'))
        .ffill()
        .reset_index(level=1, drop=True)
print (df1)
     SUBJECT STUDENT
0       Math     Sam
0  Chemistry   Peter
0    Biology    Mary
1    Geology    John
1    Physics    John

df = df.drop(['SUBJECT','STUDENT'], axis=1)
       .join(df1)
       .reset_index(drop=True)[['SUBJECT', 'STUDENT', 'CITY','STATE']]
print (df)
     SUBJECT STUDENT         CITY STATE
0       Math     Sam  Los Angeles    CA
1  Chemistry   Peter  Los Angeles    CA
2    Biology    Mary  Los Angeles    CA
3    Geology    John       Boston    MA
4    Physics    John       Boston    MA
like image 58
jezrael Avatar answered Apr 10 '26 21:04

jezrael


First thing, split fields by '/'

df.SUBJECT = df.SUBJECT.str.split('/')
df.STUDENT = df.STUDENT.str.split('/')

Then I use a function to explode rows. However, I had to segregate those rows that only had one student or subject.

def explode(df, columns):
    idx = np.repeat(df.index, df[columns[0]].str.len())
    a = df.T.reindex_axis(columns).values
    concat = np.concatenate([np.concatenate(a[i]) for i in range(a.shape[0])])
    p = pd.DataFrame(concat.reshape(a.shape[0], -1).T, idx, columns)
    return pd.concat([df.drop(columns, axis=1), p], axis=1).reset_index(drop=True)


cond = df.STUDENT.str.len() == df.SUBJECT.str.len()

df_paired = df[cond]
df_unpard = df[~cond]

if not df_paired.empty:
    df_paired = explode(df_paired, ['STUDENT','SUBJECT'])

if not df_unpard.empty:
    df_unpard = explode(explode(df_unpard, ['STUDENT']), ['SUBJECT'])

Finally

pd.concat([df_paired, df_unpard], ignore_index=True)[df.columns]

enter image description here


Timing

piRSquared

%%timeit

df = df_.copy()

df.SUBJECT = df.SUBJECT.str.split('/')
df.STUDENT = df.STUDENT.str.split('/')

def explode(df, columns):
    idx = np.repeat(df.index, df[columns[0]].str.len())
    a = df.T.reindex_axis(columns).values
    concat = np.concatenate([np.concatenate(a[i]) for i in range(a.shape[0])])
    p = pd.DataFrame(concat.reshape(a.shape[0], -1).T, idx, columns)
    return pd.concat([df.drop(columns, axis=1), p], axis=1).reset_index(drop=True)

cond = df.STUDENT.str.len() == df.SUBJECT.str.len()

df_paired = df[cond]
df_unpard = df[~cond]

if not df_paired.empty:
    df_paired = explode(df_paired, ['STUDENT','SUBJECT'])

if not df_unpard.empty:
    df_unpard = explode(explode(df_unpard, ['STUDENT']), ['SUBJECT'])

pd.concat([df_paired, df_unpard], ignore_index=True)[df.columns]

100 loops, best of 3: 7.76 ms per loop

jezrael

%%timeit

df = df_.copy()

s1 = df.SUBJECT.str.split('/', expand=True).stack()
s2 = df.STUDENT.str.split('/', expand=True).stack()

df1 = pd.concat([s1,s2], axis=1, keys=('SUBJECT','STUDENT')) \
        .ffill() \
        .reset_index(level=1, drop=True)

df.drop(['SUBJECT','STUDENT'], axis=1) \
  .join(df1) \
  .reset_index(drop=True)[['SUBJECT', 'STUDENT', 'CITY','STATE']]

100 loops, best of 3: 5.13 ms per loop
like image 28
piRSquared Avatar answered Apr 10 '26 20:04

piRSquared



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!