I have a .csv file of contact information that I import as a pandas data frame. <pre class="prettyprint"><code>>>> import pandas as pd >>> >>> df = pd.read_csv('data.csv') >>> df.head() fName lName email title 0 John Smith jsmith@gmail.com CEO 1 Joe Schmo jschmo@business.com Bagger 2 Some Person some.person@hotmail.com Clerk </code></pre> After importing the data, I'd like to drop rows where one field contains one of several substrings in a list. For example: <pre class="prettyprint"><code>to_drop = ['Clerk', 'Bagger'] for i in range(len(df)): for k in range(len(to_drop)): if to_drop[k] in df.title[i]: # some code to drop the rows from the data frame df.to_csv("results.csv") </code></pre> What is the preferred way to do this in Pandas? Should this even be a post-processing step, or is it preferred to filter this prior to writing to the data frame in the first place? My thought was that this would be easier to manipulate once in a data frame object.

Use <code>isin</code> and pass your list of terms to search for you can then negate the boolean mask using <code>~</code> and this will filter out those rows: <pre class="prettyprint"><code>In [6]: to_drop = ['Clerk', 'Bagger'] df[~df['title'].isin(to_drop)] Out[6]: fName lName email title 0 John Smith jsmith@gmail.com CEO </code></pre> Another method is to join the terms so it becomes a regex and use the vectorised <code>str.contains</code>: <pre class="prettyprint"><code>In [8]: df[~df['title'].str.contains('|'.join(to_drop))] Out[8]: fName lName email title 0 John Smith jsmith@gmail.com CEO </code></pre> IMO it will be easier and probably faster to perform the filtering as a post processing step because if you decide to filter whilst reading then you are iteratively growing the dataframe which is not efficient. Alternatively you can read the csv in chunks, filter out the rows you don't want and append the chunks to your output csv

Python/Pandas: Drop rows from data frame on string match from list

Tags:

python

pandas

I have a .csv file of contact information that I import as a pandas data frame.

>>> import pandas as pd
>>> 
>>> df = pd.read_csv('data.csv')
>>> df.head()

  fName   lName                    email   title
0  John   Smith         [email protected]     CEO
1   Joe   Schmo      [email protected]  Bagger
2  Some  Person  [email protected]   Clerk

After importing the data, I'd like to drop rows where one field contains one of several substrings in a list. For example:

to_drop = ['Clerk', 'Bagger']

for i in range(len(df)):
    for k in range(len(to_drop)):
        if to_drop[k] in df.title[i]:
            # some code to drop the rows from the data frame

df.to_csv("results.csv")

What is the preferred way to do this in Pandas? Should this even be a post-processing step, or is it preferred to filter this prior to writing to the data frame in the first place? My thought was that this would be easier to manipulate once in a data frame object.

968

asked Jul 27 '15 21:07

Sidney VanNess

2 Answers

Use isin and pass your list of terms to search for you can then negate the boolean mask using ~ and this will filter out those rows:

In [6]:

to_drop = ['Clerk', 'Bagger']
df[~df['title'].isin(to_drop)]
Out[6]:
  fName  lName             email title
0  John  Smith  [email protected]   CEO

Another method is to join the terms so it becomes a regex and use the vectorised str.contains:

In [8]:

df[~df['title'].str.contains('|'.join(to_drop))]
Out[8]:
  fName  lName             email title
0  John  Smith  [email protected]   CEO

IMO it will be easier and probably faster to perform the filtering as a post processing step because if you decide to filter whilst reading then you are iteratively growing the dataframe which is not efficient.

Alternatively you can read the csv in chunks, filter out the rows you don't want and append the chunks to your output csv

answered Oct 15 '22 01:10

EdChum

Another way using query

In [961]: to_drop = ['Clerk', 'Bagger']

In [962]: df.query('title not in @to_drop')
Out[962]:
  fName  lName             email title
0  John  Smith  [email protected]   CEO

answered Oct 15 '22 02:10

Zero

Related questions
                            
                                Python, Press Any Key To Exit
                            
                                How does Python insertion sort work?
                            
                                Using sorted() in Python [duplicate]
                            
                                Django Storages - Could Not Load Amazon's S3 Bindings Errors
                            
                                convert python dataframe to list [duplicate]
                            
                                Scrapy CrawlSpider doesn't crawl the first landing page
                            
                                Updating object properties in list comprehension way
                            
                                subprocess.Popen : how to pass a list as argument
                            
                                How to get ordered dictionaries in pymongo?
                            
                                Generating smooth line graph using matplotlib
                            
                                Return image url in Django Rest Framework
                            
                                Does filter preserve list ordering?
                            
                                How to write Flask decorator with request?
                            
                                Spyder Python indentation
                            
                                How to check whether python package is installed or not in Docker?
                            
                                Strict comparison
                            
                                Equivalent of R's paste command for vector of numbers in Python
                            
                                How convert this type of data <hdf5 object reference> to something more readable in the python?
                            
                                How to get getting base_url in django template
                            
                                How to pip install cairocffi?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With