Python: How to delete rows ending in certain characters?

Tags:

I have a large data file and I need to delete rows that end in certain letters.

Here is an example of the file I'm using:

User Name     DN
MB212DA       CN=MB212DA,CN=Users,DC=prod,DC=trovp,DC=net
MB423DA       CN=MB423DA,OU=Generic Mailbox,DC=prod,DC=trovp,DC=net
MB424PL       CN=MB424PL,CN=Users,DC=prod,DC=trovp,DC=net
MBDA423       CN=MBDA423,OU=DNA,DC=prod,DC=trovp,DC=net
MB2ADA4       CN=MB2ADA4,OU=DNA,DC=prod,DC=trovp,DC=netenter code here

Code I am using:

from pandas import DataFrame, read_csv
import pandas as pd
f = pd.read_csv('test1.csv', sep=',',encoding='latin1') 
df = f.loc[~(~pd.isnull(f['User Name']) & f['UserName'].str.contains("DA|PL",))]

How do I use regular expression syntax to delete the words that end in "DA" and "PL" but make sure I do not delete the other rows because they contain "DA" or "PL" inside of them?

It should delete the rows and I end up with a file like this:

User Name     DN
MBDA423       CN=MBDA423,OU=DNA,DC=prod,DC=trovp,DC=net
MB2ADA4       CN=MB2ADA4,OU=DNA,DC=prod,DC=trovp,DC=net

First 3 rows are deleted because they ended in DA and PL.

818

asked Jul 28 '16 19:07

JetCorey

2 Answers

You could use this expression

df = df[~df['User Name'].str.contains('(?:DA|PL)$')]

It will return all rows that don't end in either DA or PL.

The ?: is so that the brackets would not capture anything. Otherwise, you'd see pandas returning the following (harmless) warning:

UserWarning: This pattern has match groups. To actually get the groups, use str.extract.

Alternatively, using endswith() and without regular expressions, the same filtering could be achieved by using the following expression:

df = df[~df['User Name'].str.endswith(('DA', 'PL'))]

As expected, the version without regular expression will be faster. A simple test, consisting of big_df, which consists of 10001 copies of your original df:

# Create a larger DF to get better timing results
big_df = df.copy()

for i in range(10000):
    big_df = big_df.append(df)

print(big_df.shape)

>> (50005, 2)

# Without regular expressions
%%timeit
big_df[~big_df['User Name'].str.endswith(('DA', 'PL'))]

>> 10 loops, best of 3: 22.3 ms per loop

# With regular expressions
%%timeit
big_df[~big_df['User Name'].str.contains('(?:DA|PL)$')]

>> 10 loops, best of 3: 61.8 ms per loop

150

answered Oct 05 '22 22:10

DocZerø

You can use a boolean mask whereby you check if the last two characters of User_Name are in not (~) in a set of two character endings:

>>> df[~df.User_Name.str[-2:].isin(['DA', 'PA'])]
  User_Name                                                 DN
2   MB424PL    CN=MB424PL, CN=Users, DC=prod, DC=trovp, DC=net
3   MBDA423      CN=MBDA423, OU=DNA, DC=prod, DC=trovp, DC=net
4   MB2ADA4  CN=MB2ADA4, OU=DNA, DC=prod, DC=trovp, DC=nete...

answered Oct 06 '22 00:10

Alexander

Related questions
                            
                                How do I access a variable in sphinx conf.py from my .rst file?
                            
                                Extract nested JSON embedded as string in Pandas dataframe
                            
                                Unexpected output using Pythons' ternary operator in combination with lambda
                            
                                How to reschedule 403 HTTP status codes to be crawled later in scrapy?
                            
                                Numpy types for Cython users
                            
                                Determining if A Value is in a Set in TensorFlow
                            
                                Python text processing: NLTK and pandas
                            
                                How can I extract the main diagonal of a sparse matrix?
                            
                                Why does apply change dtype in pandas dataframe columns
                            
                                pythonic way for axis-wise winner-take-all in numpy
                            
                                return a list of all datasets in a hdf file with pandas
                            
                                Implement a java UDF and call it from pyspark
                            
                                What does "late binding closures" mean? [duplicate]
                            
                                Is it possible for Python to display LaTex in real time in a text box?
                            
                                Pandas .loc with Tuple column names
                            
                                Can't log in to admin site in Django
                            
                                list of objects python
                            
                                Find location of image inside bigger image
                            
                                Forcing requests library to use TLSv1.1 or TLSv1.2 in Python
                            
                                Why is a generator produced by yield faster than a generator produced by xrange?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python: How to delete rows ending in certain characters?

Tags:

python

python-3.x

pandas

JetCorey

People also ask

2 Answers

DocZerø

Alexander

Recent Activity

Donate For Us