I have a .csv file of contact information that I import as a pandas data frame.
>>> import pandas as pd
>>>
>>> df = pd.read_csv('data.csv')
>>> df.head()
fName lName email title
0 John Smith [email protected] CEO
1 Joe Schmo [email protected] Bagger
2 Some Person [email protected] Clerk
After importing the data, I'd like to drop rows where one field contains one of several substrings in a list. For example:
to_drop = ['Clerk', 'Bagger']
for i in range(len(df)):
for k in range(len(to_drop)):
if to_drop[k] in df.title[i]:
# some code to drop the rows from the data frame
df.to_csv("results.csv")
What is the preferred way to do this in Pandas? Should this even be a post-processing step, or is it preferred to filter this prior to writing to the data frame in the first place? My thought was that this would be easier to manipulate once in a data frame object.
You can delete a list of rows from Pandas by passing the list of indices to the drop() method. In this code, [5,6] is the index of the rows you want to delete. axis=0 denotes that rows should be deleted from the dataframe.
One of the fastest ways to delete rows that contain a specific value or fulfill a given condition is to filter these. Once you have the filtered data, you can delete all these rows (while the remaining rows remain intact).
We can drop single or multiple columns from the dataframe just by passing the name of columns and by setting up the axis =1.
We can use the column_name function along with the operator to drop the specific value.
Use isin
and pass your list of terms to search for you can then negate the boolean mask using ~
and this will filter out those rows:
In [6]:
to_drop = ['Clerk', 'Bagger']
df[~df['title'].isin(to_drop)]
Out[6]:
fName lName email title
0 John Smith [email protected] CEO
Another method is to join the terms so it becomes a regex and use the vectorised str.contains
:
In [8]:
df[~df['title'].str.contains('|'.join(to_drop))]
Out[8]:
fName lName email title
0 John Smith [email protected] CEO
IMO it will be easier and probably faster to perform the filtering as a post processing step because if you decide to filter whilst reading then you are iteratively growing the dataframe which is not efficient.
Alternatively you can read the csv in chunks, filter out the rows you don't want and append the chunks to your output csv
Another way using query
In [961]: to_drop = ['Clerk', 'Bagger']
In [962]: df.query('title not in @to_drop')
Out[962]:
fName lName email title
0 John Smith [email protected] CEO
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With