Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python/Pandas: Drop rows from data frame on string match from list

Tags:

python

pandas

I have a .csv file of contact information that I import as a pandas data frame.

>>> import pandas as pd
>>> 
>>> df = pd.read_csv('data.csv')
>>> df.head()

  fName   lName                    email   title
0  John   Smith         [email protected]     CEO
1   Joe   Schmo      [email protected]  Bagger
2  Some  Person  [email protected]   Clerk

After importing the data, I'd like to drop rows where one field contains one of several substrings in a list. For example:

to_drop = ['Clerk', 'Bagger']

for i in range(len(df)):
    for k in range(len(to_drop)):
        if to_drop[k] in df.title[i]:
            # some code to drop the rows from the data frame

df.to_csv("results.csv")

What is the preferred way to do this in Pandas? Should this even be a post-processing step, or is it preferred to filter this prior to writing to the data frame in the first place? My thought was that this would be easier to manipulate once in a data frame object.

like image 968
Sidney VanNess Avatar asked Jul 27 '15 21:07

Sidney VanNess


People also ask

How do I drop a row based on a list in Pandas?

You can delete a list of rows from Pandas by passing the list of indices to the drop() method. In this code, [5,6] is the index of the rows you want to delete. axis=0 denotes that rows should be deleted from the dataframe.

How do you drop rows with certain values?

One of the fastest ways to delete rows that contain a specific value or fulfill a given condition is to filter these. Once you have the filtered data, you can delete all these rows (while the remaining rows remain intact).

How do I drop multiple rows in Pandas based on condition?

We can drop single or multiple columns from the dataframe just by passing the name of columns and by setting up the axis =1.

How do you drop a row with a specific value in Python?

We can use the column_name function along with the operator to drop the specific value.


2 Answers

Use isin and pass your list of terms to search for you can then negate the boolean mask using ~ and this will filter out those rows:

In [6]:

to_drop = ['Clerk', 'Bagger']
df[~df['title'].isin(to_drop)]
Out[6]:
  fName  lName             email title
0  John  Smith  [email protected]   CEO

Another method is to join the terms so it becomes a regex and use the vectorised str.contains:

In [8]:

df[~df['title'].str.contains('|'.join(to_drop))]
Out[8]:
  fName  lName             email title
0  John  Smith  [email protected]   CEO

IMO it will be easier and probably faster to perform the filtering as a post processing step because if you decide to filter whilst reading then you are iteratively growing the dataframe which is not efficient.

Alternatively you can read the csv in chunks, filter out the rows you don't want and append the chunks to your output csv

like image 52
EdChum Avatar answered Oct 15 '22 01:10

EdChum


Another way using query

In [961]: to_drop = ['Clerk', 'Bagger']

In [962]: df.query('title not in @to_drop')
Out[962]:
  fName  lName             email title
0  John  Smith  [email protected]   CEO
like image 33
Zero Avatar answered Oct 15 '22 02:10

Zero