Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Create a new dataframe form an existing dataframe taking only the rows matching specified REGEXs in different columns in Pandas

I have a dataframe called "base_dataframe" that looks as following:

      F_NAME      L_NAME       EMAIL     
0     Suzy        Maripol      [email protected]
1     Anna        Smith        [email protected]
2     Flo         Mariland     [email protected]
3     Sarah       Linder       [email protected]
4     Nala        Kross        [email protected]
5     Sarosh      Fink         [email protected]

I would like to create a new dataframe that only contains the rows matching specific regular expressions that I define:

  • For column "F_NAME" I only want to copy over the rows that contain "Sar"
  • For column "L_NAME" I only want to copy over the rows that contain "Mari"

The way I tackle this in my code is :

sar_df = base_dataframe["F_NAME"].str.extract(r'(?P<sar_content>(^Sar.*))') 
mari_df = base_dataframe["L_NAME"].str.extract(r'(?P<mar_content>(^Mari.*))') 

Then I copy those filtered columns/DFs over to my target dataframe "new_dataframe":

new_dataframe["selected_F_NAME"] = sar_df.copy
new_dataframe["selected_L_NAME"] = mari_df.copy

And my "new_dataframe" would at the end look like this :

      F_NAME      L_NAME       EMAIL     
0     Suzy        Maripol      [email protected]
2     Flo         Mariland     [email protected]
3     Sarah       Linder       [email protected]
5     Sarosh      Fink         [email protected]

This works for me but it takes an extremely long time to copy over all the data to my "new_dataframe", because my "base_dataframe" has many hundred thousands of rows. I also need to apply multiple different regular-expressions on multiples columns ( the dataframe example I gave is basically simplified, just to explain what I want to do).

I am pretty sure there is a more optimised way to do this, but can't figure it out right now. I would appreciate any help with this.

like image 708
Saly07 Avatar asked Dec 09 '25 16:12

Saly07


1 Answers

Since you goal seems to be filtering, couldn't you replace your extract logic by a simple boolean indexing?:

# identify rows with F_NAME starting with "Sar"
m1 = base_dataframe['F_NAME'].str.startswith('Sar')
# identify rows with L_NAME starting with "Mari"
m2 = base_dataframe['L_NAME'].str.startswith('Mari')

# keep rows with either match
out = base_dataframe[m1|m2]

Or, if you have multiple conditions:

conditions = [base_dataframe['F_NAME'].str.startswith('Sar'),
              base_dataframe['L_NAME'].str.startswith('Mari'),
             # ... other conditions,
             ]

out = base_dataframe[np.logical_xor.reduce(conditions)]

Output:

   F_NAME    L_NAME            EMAIL
0    Suzy   Maripol    [email protected]
2     Flo  Mariland     [email protected]
3   Sarah    Linder   [email protected]
5  Sarosh      Fink  [email protected]
like image 87
mozway Avatar answered Dec 11 '25 04:12

mozway



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!