Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to invert a regular expression in pandas filter function

I have the following pandas dataframe df (which is actually just the last lines of a much larger one):

                           count
gene                            
WBGene00236788                56
WBGene00236807                 3
WBGene00249816                12
WBGene00249825                20
WBGene00255543                 6
__no_feature            11697881
__ambiguous                 1353
__too_low_aQual                0
__not_aligned                  0
__alignment_not_unique         0

I can use filter's regex option to get only the lines starting with two underscores:

df.filter(regex="^__", axis=0)

This returns the following:

                           count
gene                            
__no_feature            11697881
__ambiguous                 1353
__too_low_aQual                0
__not_aligned                  0
__alignment_not_unique         0

Actually, what I want is to have the complement: Only those lines that do not start with two underscores.

I can do it with another regular expression: df.filter(regex="^[^_][^_]", axis=0).

Is there a way to more simply specify that I want the inverse of the initial regular expression?

Is such regexp-based filtering efficient?

Edit: Testing some proposed solutions

df.filter(regex="(?!^__)", axis=0) and df.filter(regex="^\w+", axis=0) both return all lines.

According to the re module documentation, the \w special character actually includes the underscore, which explains the behaviour of the second expression.

I guess that the first one doesn't work because the (?!...) applies on what follows a pattern. Here, "^" should be put outside, as in the following proposed solution:

df.filter(regex="^(?!__).*?$", axis=0) works.

So does df.filter(regex="^(?!__)", axis=0).

like image 742
bli Avatar asked Dec 09 '16 17:12

bli


People also ask

Which syntax is correct for filter in pandas?

Pandas DataFrame: filter() function The filter is applied to the labels of the index. Keep labels from axis which are in items. Keep labels from axis for which “like in label == True”. Keep labels from axis for which re.search(regex, label) == True.

Which of the following will filter rows in a Pandas DataFrame?

You can filter the Rows from pandas DataFrame based on a single condition or multiple conditions either using DataFrame. loc[] attribute, DataFrame. query(), or DataFrame. apply() method.


3 Answers

I had the same problem but I wanted to filter the columns. Thus I am using axis=1 but concept should be similar.

df.drop(df.filter(regex='my_expression').columns,axis=1)
like image 72
harsshal Avatar answered Oct 16 '22 13:10

harsshal


Matching all lines with no two leading underscores:

^(?!__)

^ matches the beginning of the line (?!__)makes sure the line (what follows the preceding ^ match) does not begin with two underscores

Edit: dropped the .*?$ since it's not necessary to filter the lines.

like image 7
Robin Koch Avatar answered Oct 16 '22 12:10

Robin Koch


You have two possibilities here:

(?!^__) # a negative lookahead
        # making sure that there are no underscores right at the beginning of the line

Or:

^\w+  # match word characters, aka a-z, A-Z, 0-9 at least once
like image 1
Jan Avatar answered Oct 16 '22 12:10

Jan