Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Match mixture of characters and digits in Pandas

I need to extract substrings from a pandas df, and place them into a new column. The strings I have look like:

hj_yu_fb824_as22
jk_yu_fb638

I need to extract:

 fb824
 fb638

Moreover, they substring can be in two separate columns of the dataframe (altough appearing only once), because the df looks looks like:

col1                col2
mf_lp_gn817_ml46    d_nb_05340.gif 
desktop_300x250_mf  mf_lp_fb824_ml46.html 
desktop_300x250_mf  dd_lp_ig805.html 
desktop_728x90_mf   mf_lp_fb824_ml46.html 

I would like to obtain something like:

col1                col2                     col3
mf_lp_gn817_ml46    d_nb_05340.gif           gn817
desktop_300x250_mf  mf_lp_fb824_ml46.html    fb824
desktop_300x250_mf  dd_lp_ig805.html         ig805
desktop_728x90_mf   mf_lp_fb824_ml46.html    fb824

So the substring looks like:

1) two lower case characters at the beginning, followed by 3 digits 2) between two '' or with just one '', or between '_' and '.' something else

I came up with:

 \_([^()]*)\_

But it just matches anything between the "_"s regardless of the pattern described above.

And moreover, how efficiently apply a regex to a pandas dataframe?

Here's the reproducible dataframe:

df = DataFrame({'col1': {0: 'mf_lp_gn817_ml46',
 1: 'desktop_300x250_mf',
 2: 'desktop_300x250_mf',
 3: 'desktop_728x90_mf'},
 'col2': {0: 'd_nb_05340.gif ',
 1: 'mf_lp_fb824_ml46.html ',
 2: 'dd_lp_ig805.html ',
 3: 'mf_lp_fb824_ml46.html '},
 'col3': {0: 'gn817', 1: 'fb824', 2: 'ig805', 3: 'fb824'}})
like image 449
chopin_is_the_best Avatar asked Jan 29 '26 10:01

chopin_is_the_best


1 Answers

There are possibly more input strings necessary but for your above strings you could come up with the following regex:

_([a-z]{2}[0-9]{3})[_.]
# this is an underscore
# followed by exactly 2 letters and 3 digits
# followed by an underscore or a dot
# the whole match is captured to group1

For your above strings this would be:

mf_lp_gn817_ml46    d_nb_05340.gif           -> gn817
desktop_300x250_mf  mf_lp_fb824_ml46.html    -> fb824
desktop_300x250_mf  dd_lp_ig805.html         -> ig805
desktop_728x90_mf   mf_lp_fb824_ml46.html    -> fb824

See a demo on regex101.com.

Python Code:

To apply this to your DataFrame, see the following code:

import pandas as pd
from pandas import DataFrame
import re

df = DataFrame({'col1': {0: 'mf_lp_gn817_ml46',
 1: 'desktop_300x250_mf',
 2: 'desktop_300x250_mf',
 3: 'desktop_728x90_mf'},
 'col2': {0: 'd_nb_05340.gif ',
 1: 'mf_lp_fb824_ml46.html ',
 2: 'dd_lp_ig805.html ',
 3: 'mf_lp_fb824_ml46.html '}})

regex = r'_([a-z]{2}[0-9]{3})[_.]'
for index, row in df.iterrows():
    for column in row.keys():
        m = re.search(regex, row[column])
        if m is not None:
            df.ix[index, 'col3'] = m.group(1)
like image 96
Jan Avatar answered Jan 31 '26 00:01

Jan



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!