Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filter out rows based on list of strings in Pandas

I have a large time series data frame (called df), and the first 5 records look like this:

df

         stn     years_of_data  total_minutes avg_daily TOA_daily   K_daily
date                        
1900-01-14  AlberniElementary      4    5745    34.100  114.600 0.298
1900-01-14  AlberniWeather         6    7129    29.500  114.600 0.257
1900-01-14  Arbutus                8    11174   30.500  114.600 0.266
1900-01-14  Arrowview              7    10080   27.600  114.600 0.241
1900-01-14  Bayside                7    9745    33.800  114.600 0.295

Goal:

I am trying to remove rows where any of the strings in a list are present in the 'stn' column. So,I am basically trying to filter this dataset to not include rows containing any of the strings in following list.

Attempt:

remove_list = ['Arbutus','Bayside']

cleaned = df[df['stn'].str.contains('remove_list')]

Returns:

Out[78]:

stn years_of_data   total_minutes   avg_daily   TOA_daily   K_daily
date    

Nothing!

I have tried a few combinations of quotes, brackets, and even a lambda function; though I am fairly new, so probably not using syntax properly..

like image 888
geokrowding Avatar asked Mar 07 '15 10:03

geokrowding


2 Answers

Use isin:

cleaned = df[~df['stn'].isin(remove_list)]

In [7]:

remove_list = ['Arbutus','Bayside']
df[~df['stn'].isin(remove_list)]
Out[7]:
                          stn  years_of_data  total_minutes  avg_daily  \
date                                                                     
1900-01-14  AlberniElementary              4           5745       34.1   
1900-01-14     AlberniWeather              6           7129       29.5   
1900-01-14          Arrowview              7          10080       27.6   

            TOA_daily  K_daily  
date                            
1900-01-14      114.6    0.298  
1900-01-14      114.6    0.257  
1900-01-14      114.6    0.241  
like image 78
EdChum Avatar answered Oct 16 '22 21:10

EdChum


Had a similar question, found this old thread, I think there are other ways to get the same result. My issue with @EdChum's solution for my particular application is that I don't have a list that will be matched exactly. If you have the same issue, .isin isn't meant for that application.

Instead, you can also try a few options, including a numpy.where:

  removelist = ['ayside','rrowview']
  df['flagCol'] = numpy.where(df.stn.str.contains('|'.join(remove_list)),1,0)

Note that this solution doesn't actually remove the matching rows, just flags them. You can copy/slice/drop as you like.

This solution would be useful in the case that you don't know, for example, if the station names are capitalized or not and don't want to go through standardizing text beforehand. numpy.where is usually pretty fast as well, probably not much different from .isin.

like image 18
rajan Avatar answered Oct 16 '22 21:10

rajan