Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to group Pandas data frame by column with regex match

I have the following data frame:

import pandas as pd
df = pd.DataFrame({'id':['a','b','c','d','e'],
                   'XX_111_S5_R12_001_Mobile_05':[-14,-90,-90,-96,-91],
                   'YY_222_S00_R12_001_1-999_13':[-103,0,-110,-114,-114],
                   'ZZ_111_S00_R12_001_1-999_13':[1,2.3,3,5,6],
})

df.set_index('id',inplace=True)
df

Which looks like this:

Out[6]:
    XX_111_S5_R12_001_Mobile_05  YY_222_S00_R12_001_1-999_13  ZZ_111_S00_R12_001_1-999_13
id
a                           -14                         -103                          1.0
b                           -90                            0                          2.3
c                           -90                         -110                          3.0
d                           -96                         -114                          5.0
e                           -91                         -114                          6.0

What I want to do is to group the column based on the following regex:

\w+_\w+_\w+_\d+_([\w\d-]+)_\d+

So that in the end it's grouped by Mobile, and 1-999.

What's the way to do it. I tried this but fail to group them:

import re
grouped = df.groupby(lambda x: re.search("\w+_\w+_\w+_\d+_([\w\d-]+)_\d+", x).group(), axis=1)
for name, group in grouped:
    print name
    print group

Which prints:

XX_111_S5_R12_001_Mobile_05
YY_222_S00_R12_001_1-999_13
ZZ_111_S00_R12_001_1-999_13

What we want is name prints to:

Mobile
1-999
1-999

And group prints the corresponding data frame.

like image 629
neversaint Avatar asked Dec 23 '22 19:12

neversaint


2 Answers

You can use .str.extract on the columns in order to extract substrings for your groupby:

# Performing the groupby.
pat = '\w+_\w+_\w+_\d+_([\w\d-]+)_\d+'
grouped = df.groupby(df.columns.str.extract(pat, expand=False), axis=1)

# Showing group information.
for name, group in grouped:
    print name
    print group, '\n'

Which returns the expected groups:

1-999
    YY_222_S00_R12_001_1-999_13  ZZ_111_S00_R12_001_1-999_13
id                                                          
a                          -103                          1.0
b                             0                          2.3
c                          -110                          3.0
d                          -114                          5.0
e                          -114                          6.0 

Mobile
    XX_111_S5_R12_001_Mobile_05
id                             
a                           -14
b                           -90
c                           -90
d                           -96
e                           -91 
like image 86
root Avatar answered Jan 29 '23 20:01

root


After grouping, set the index of the new dataframe to [re.findall(r'\w+_\w+_\w+_\d+_([\w\d-]+)_\d+', col)[0] for col in df.columns] (which is ['Mobile', '1-999', '1-999']).

like image 25
DYZ Avatar answered Jan 29 '23 22:01

DYZ