Python Pandas: Split slash separated strings in two or more columns into multiple rows

Question

I have a pandas dataframe that looks like this:

SUBJECT                 STUDENT         CITY        STATE

Math/Chemistry/Biology  Sam/Peter/Mary  Los Angeles CA
Geology/Physics         John            Boston      MA

This is how it should look like:

SUBJECT      STUDENT    CITY           STATE

Math         Sam        Los Angeles    CA
Chemistry    Peter      Los Angeles    CA
Biology      Mary       Los Angeles    CA
Geology      John       Boston         MA
Physics      John       Boston         MA

Before asking this question, I referred to the solutions mentioned in this page: pandas: How do I split text in a column into multiple rows?

Since there are slash separated strings in two columns, I am not able to use the solutions in the above link.

jezrael · Accepted Answer

Another solution with concat and join:

s1 = df.SUBJECT.str.split('/', expand=True).stack()
s2 = df.STUDENT.str.split('/', expand=True).stack()
print (s1)
0  0         Math
   1    Chemistry
   2      Biology
1  0      Geology
   1      Physics

print (s2)
0  0      Sam
   1    Peter
   2     Mary
1  0     John
dtype: object

df1 = pd.concat([s1,s2], axis=1, keys=('SUBJECT','STUDENT'))
        .ffill()
        .reset_index(level=1, drop=True)
print (df1)
     SUBJECT STUDENT
0       Math     Sam
0  Chemistry   Peter
0    Biology    Mary
1    Geology    John
1    Physics    John

df = df.drop(['SUBJECT','STUDENT'], axis=1)
       .join(df1)
       .reset_index(drop=True)[['SUBJECT', 'STUDENT', 'CITY','STATE']]
print (df)
     SUBJECT STUDENT         CITY STATE
0       Math     Sam  Los Angeles    CA
1  Chemistry   Peter  Los Angeles    CA
2    Biology    Mary  Los Angeles    CA
3    Geology    John       Boston    MA
4    Physics    John       Boston    MA

piRSquared · Answer

First thing, split fields by '/'

df.SUBJECT = df.SUBJECT.str.split('/')
df.STUDENT = df.STUDENT.str.split('/')

Then I use a function to explode rows. However, I had to segregate those rows that only had one student or subject.

def explode(df, columns):
    idx = np.repeat(df.index, df[columns[0]].str.len())
    a = df.T.reindex_axis(columns).values
    concat = np.concatenate([np.concatenate(a[i]) for i in range(a.shape[0])])
    p = pd.DataFrame(concat.reshape(a.shape[0], -1).T, idx, columns)
    return pd.concat([df.drop(columns, axis=1), p], axis=1).reset_index(drop=True)


cond = df.STUDENT.str.len() == df.SUBJECT.str.len()

df_paired = df[cond]
df_unpard = df[~cond]

if not df_paired.empty:
    df_paired = explode(df_paired, ['STUDENT','SUBJECT'])

if not df_unpard.empty:
    df_unpard = explode(explode(df_unpard, ['STUDENT']), ['SUBJECT'])

Finally

pd.concat([df_paired, df_unpard], ignore_index=True)[df.columns]

enter image description here

Timing

piRSquared

%%timeit

df = df_.copy()

df.SUBJECT = df.SUBJECT.str.split('/')
df.STUDENT = df.STUDENT.str.split('/')

def explode(df, columns):
    idx = np.repeat(df.index, df[columns[0]].str.len())
    a = df.T.reindex_axis(columns).values
    concat = np.concatenate([np.concatenate(a[i]) for i in range(a.shape[0])])
    p = pd.DataFrame(concat.reshape(a.shape[0], -1).T, idx, columns)
    return pd.concat([df.drop(columns, axis=1), p], axis=1).reset_index(drop=True)

cond = df.STUDENT.str.len() == df.SUBJECT.str.len()

df_paired = df[cond]
df_unpard = df[~cond]

if not df_paired.empty:
    df_paired = explode(df_paired, ['STUDENT','SUBJECT'])

if not df_unpard.empty:
    df_unpard = explode(explode(df_unpard, ['STUDENT']), ['SUBJECT'])

pd.concat([df_paired, df_unpard], ignore_index=True)[df.columns]

100 loops, best of 3: 7.76 ms per loop

jezrael

%%timeit

df = df_.copy()

s1 = df.SUBJECT.str.split('/', expand=True).stack()
s2 = df.STUDENT.str.split('/', expand=True).stack()

df1 = pd.concat([s1,s2], axis=1, keys=('SUBJECT','STUDENT')) \
        .ffill() \
        .reset_index(level=1, drop=True)

df.drop(['SUBJECT','STUDENT'], axis=1) \
  .join(df1) \
  .reset_index(drop=True)[['SUBJECT', 'STUDENT', 'CITY','STATE']]

100 loops, best of 3: 5.13 ms per loop

Python Pandas: Split slash separated strings in two or more columns into multiple rows

Tags:

python

pandas

dataframe

KeyboardWarrior

2 Answers

jezrael

Timing

piRSquared

Recent Activity

Donate For Us

Python Pandas: Split slash separated strings in two or more columns into multiple rows

Tags:

python

pandas

dataframe

KeyboardWarrior

2 Answers

jezrael

Timing

piRSquared

Related questions

Recent Activity

Donate For Us