Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pythonic way of applying regex to all columns of dataframe

I have a dataframe containing keywords and value in all columns. See the example below.

Input DataFrame

I want to apply regex to all the columns. So I use for loop and apply the regex:

for i in range (1,maxExtended_Keywords):
    temp = 'extdkey_' + str(i)
    Extended_Keywords[temp] = Extended_Keywords[temp].str.extract(":(.*)",expand=True)

And I get the desired final result. No issues there.

Desired output

However, just curios is there a pythonic way to apply regex to entire dataframe instead of using for loop and applying to column wise.

Thanks,

like image 234
prasadav Avatar asked Apr 13 '18 19:04

prasadav


People also ask

How do I select all columns in pandas?

By using df[], loc[], iloc[] and get() you can select multiple columns from pandas DataFrame.

What is regex in pandas replace?

Pandas replace() is a very rich function that is used to replace a string, regex, dictionary, list, and series from the DataFrame. The values of the DataFrame can be replaced with other values dynamically. It is capable of working with the Python regex(regular expression). It differs from updating with . loc or .


2 Answers

Use pandas.DataFrame.replace with regex=True

df.replace('^.*:\s*(.*)', r'\1', regex=True)

Notice that my pattern uses parentheses to capture the part after the ':' and uses a raw string r'\1' to reference that capture group.


MCVE

df = pd.DataFrame([
    [np.nan, 'thing1: hello'],
    ['thing2: world', np.nan]
], columns=['extdkey1', 'extdkey2'])

df

        extdkey1       extdkey2
0            NaN  thing1: hello
1  thing2: world            NaN

df.replace('^.*:\s*(.*)', r'\1', regex=True)

  extdkey1 extdkey2
0      NaN    hello
1    world      NaN
like image 184
piRSquared Avatar answered Sep 28 '22 10:09

piRSquared


You can use applymap, it will apply some function for each element in dataframe, for this problem you can do this:

func = lambda x: re.findall('^.*:\s*(.*)', x)[0] if re.findall('^.*:\s*(.*)', str(x)) else x
df.applymap(func)

Caution: Avoid to use applymap for huge dataframes due to efficiency issue.

like image 44
romulomadu Avatar answered Sep 28 '22 09:09

romulomadu