Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: How to delete rows ending in certain characters?

I have a large data file and I need to delete rows that end in certain letters.

Here is an example of the file I'm using:

User Name     DN
MB212DA       CN=MB212DA,CN=Users,DC=prod,DC=trovp,DC=net
MB423DA       CN=MB423DA,OU=Generic Mailbox,DC=prod,DC=trovp,DC=net
MB424PL       CN=MB424PL,CN=Users,DC=prod,DC=trovp,DC=net
MBDA423       CN=MBDA423,OU=DNA,DC=prod,DC=trovp,DC=net
MB2ADA4       CN=MB2ADA4,OU=DNA,DC=prod,DC=trovp,DC=netenter code here

Code I am using:

from pandas import DataFrame, read_csv
import pandas as pd
f = pd.read_csv('test1.csv', sep=',',encoding='latin1') 
df = f.loc[~(~pd.isnull(f['User Name']) & f['UserName'].str.contains("DA|PL",))]

How do I use regular expression syntax to delete the words that end in "DA" and "PL" but make sure I do not delete the other rows because they contain "DA" or "PL" inside of them?

It should delete the rows and I end up with a file like this:

User Name     DN
MBDA423       CN=MBDA423,OU=DNA,DC=prod,DC=trovp,DC=net
MB2ADA4       CN=MB2ADA4,OU=DNA,DC=prod,DC=trovp,DC=net

First 3 rows are deleted because they ended in DA and PL.

like image 818
JetCorey Avatar asked Jul 28 '16 19:07

JetCorey


People also ask

How do I remove a row from a special character in Python?

To drop such types of rows, first, we have to search rows having special characters per column and then drop. To search we use regular expression either [@#&$%+-/*] or [^0-9a-zA-Z].

How do I remove unwanted rows from a DataFrame in Python?

To drop a row or column in a dataframe, you need to use the drop() method available in the dataframe. You can read more about the drop() method in the docs here. Rows are labelled using the index number starting with 0, by default. Columns are labelled using names.

How do I remove special characters from a dataset in Python?

Remove Special Characters Including Strings Using Python isalnum. Python has a special string method, . isalnum() , which returns True if the string is an alpha-numeric character, and returns False if it is not. We can use this, to loop over a string and append, to a new string, only alpha-numeric characters.


2 Answers

You could use this expression

df = df[~df['User Name'].str.contains('(?:DA|PL)$')]

It will return all rows that don't end in either DA or PL.

The ?: is so that the brackets would not capture anything. Otherwise, you'd see pandas returning the following (harmless) warning:

UserWarning: This pattern has match groups. To actually get the groups, use str.extract.

Alternatively, using endswith() and without regular expressions, the same filtering could be achieved by using the following expression:

df = df[~df['User Name'].str.endswith(('DA', 'PL'))]

As expected, the version without regular expression will be faster. A simple test, consisting of big_df, which consists of 10001 copies of your original df:

# Create a larger DF to get better timing results
big_df = df.copy()

for i in range(10000):
    big_df = big_df.append(df)

print(big_df.shape)

>> (50005, 2)

# Without regular expressions
%%timeit
big_df[~big_df['User Name'].str.endswith(('DA', 'PL'))]

>> 10 loops, best of 3: 22.3 ms per loop

# With regular expressions
%%timeit
big_df[~big_df['User Name'].str.contains('(?:DA|PL)$')]

>> 10 loops, best of 3: 61.8 ms per loop
like image 150
DocZerø Avatar answered Oct 05 '22 22:10

DocZerø


You can use a boolean mask whereby you check if the last two characters of User_Name are in not (~) in a set of two character endings:

>>> df[~df.User_Name.str[-2:].isin(['DA', 'PA'])]
  User_Name                                                 DN
2   MB424PL    CN=MB424PL, CN=Users, DC=prod, DC=trovp, DC=net
3   MBDA423      CN=MBDA423, OU=DNA, DC=prod, DC=trovp, DC=net
4   MB2ADA4  CN=MB2ADA4, OU=DNA, DC=prod, DC=trovp, DC=nete...
like image 33
Alexander Avatar answered Oct 06 '22 00:10

Alexander