I have a large data file and I need to delete rows that end in certain letters.
Here is an example of the file I'm using:
User Name DN
MB212DA CN=MB212DA,CN=Users,DC=prod,DC=trovp,DC=net
MB423DA CN=MB423DA,OU=Generic Mailbox,DC=prod,DC=trovp,DC=net
MB424PL CN=MB424PL,CN=Users,DC=prod,DC=trovp,DC=net
MBDA423 CN=MBDA423,OU=DNA,DC=prod,DC=trovp,DC=net
MB2ADA4 CN=MB2ADA4,OU=DNA,DC=prod,DC=trovp,DC=netenter code here
Code I am using:
from pandas import DataFrame, read_csv
import pandas as pd
f = pd.read_csv('test1.csv', sep=',',encoding='latin1')
df = f.loc[~(~pd.isnull(f['User Name']) & f['UserName'].str.contains("DA|PL",))]
How do I use regular expression syntax to delete the words that end in "DA" and "PL" but make sure I do not delete the other rows because they contain "DA" or "PL" inside of them?
It should delete the rows and I end up with a file like this:
User Name DN
MBDA423 CN=MBDA423,OU=DNA,DC=prod,DC=trovp,DC=net
MB2ADA4 CN=MB2ADA4,OU=DNA,DC=prod,DC=trovp,DC=net
First 3 rows are deleted because they ended in DA and PL.
To drop such types of rows, first, we have to search rows having special characters per column and then drop. To search we use regular expression either [@#&$%+-/*] or [^0-9a-zA-Z].
To drop a row or column in a dataframe, you need to use the drop() method available in the dataframe. You can read more about the drop() method in the docs here. Rows are labelled using the index number starting with 0, by default. Columns are labelled using names.
Remove Special Characters Including Strings Using Python isalnum. Python has a special string method, . isalnum() , which returns True if the string is an alpha-numeric character, and returns False if it is not. We can use this, to loop over a string and append, to a new string, only alpha-numeric characters.
You could use this expression
df = df[~df['User Name'].str.contains('(?:DA|PL)$')]
It will return all rows that don't end in either DA or PL.
The ?:
is so that the brackets would not capture anything. Otherwise, you'd see pandas returning the following (harmless) warning:
UserWarning: This pattern has match groups. To actually get the groups, use str.extract.
Alternatively, using endswith()
and without regular expressions, the same filtering could be achieved by using the following expression:
df = df[~df['User Name'].str.endswith(('DA', 'PL'))]
As expected, the version without regular expression will be faster. A simple test, consisting of big_df
, which consists of 10001 copies of your original df
:
# Create a larger DF to get better timing results
big_df = df.copy()
for i in range(10000):
big_df = big_df.append(df)
print(big_df.shape)
>> (50005, 2)
# Without regular expressions
%%timeit
big_df[~big_df['User Name'].str.endswith(('DA', 'PL'))]
>> 10 loops, best of 3: 22.3 ms per loop
# With regular expressions
%%timeit
big_df[~big_df['User Name'].str.contains('(?:DA|PL)$')]
>> 10 loops, best of 3: 61.8 ms per loop
You can use a boolean mask whereby you check if the last two characters of User_Name
are in not (~
) in a set of two character endings:
>>> df[~df.User_Name.str[-2:].isin(['DA', 'PA'])]
User_Name DN
2 MB424PL CN=MB424PL, CN=Users, DC=prod, DC=trovp, DC=net
3 MBDA423 CN=MBDA423, OU=DNA, DC=prod, DC=trovp, DC=net
4 MB2ADA4 CN=MB2ADA4, OU=DNA, DC=prod, DC=trovp, DC=nete...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With