I have the following pandas dataframe <code>df</code> (which is actually just the last lines of a much larger one): <pre class="prettyprint"><code> count gene WBGene00236788 56 WBGene00236807 3 WBGene00249816 12 WBGene00249825 20 WBGene00255543 6 __no_feature 11697881 __ambiguous 1353 __too_low_aQual 0 __not_aligned 0 __alignment_not_unique 0 </code></pre> I can use <code>filter</code>'s <code>regex</code> option to get only the lines starting with two underscores: <pre class="prettyprint"><code>df.filter(regex="^__", axis=0) </code></pre> This returns the following: <pre class="prettyprint"><code> count gene __no_feature 11697881 __ambiguous 1353 __too_low_aQual 0 __not_aligned 0 __alignment_not_unique 0 </code></pre> Actually, what I want is to have the complement: Only those lines that do not start with two underscores. I can do it with another regular expression: <code>df.filter(regex="^[^_][^_]", axis=0)</code>. Is there a way to more simply specify that I want the inverse of the initial regular expression? Is such regexp-based filtering efficient? <h3>Edit: Testing some proposed solutions</h3> <code>df.filter(regex="(?!^__)", axis=0)</code> and <code>df.filter(regex="^\w+", axis=0)</code> both return all lines. According to the <code>re</code> module documentation, the <code>\w</code> special character actually includes the underscore, which explains the behaviour of the second expression. I guess that the first one doesn't work because the <code>(?!...)</code> applies on what follows a pattern. Here, "^" should be put outside, as in the following proposed solution: <code>df.filter(regex="^(?!__).*?$", axis=0)</code> works. So does <code>df.filter(regex="^(?!__)", axis=0)</code>.

I had the same problem but I wanted to filter the columns. Thus I am using axis=1 but concept should be similar. <pre class="prettyprint"><code>df.drop(df.filter(regex='my_expression').columns,axis=1) </code></pre>

Matching all lines with no two leading underscores: <code>^(?!__)</code> <code>^</code> matches the beginning of the line <code>(?!__)</code>makes sure the line (what follows the preceding <code>^</code> match) does not begin with two underscores Edit: dropped the <code>.*?$</code> since it's not necessary to filter the lines.

How to invert a regular expression in pandas filter function

I have the following pandas dataframe df (which is actually just the last lines of a much larger one):

                           count
gene                            
WBGene00236788                56
WBGene00236807                 3
WBGene00249816                12
WBGene00249825                20
WBGene00255543                 6
__no_feature            11697881
__ambiguous                 1353
__too_low_aQual                0
__not_aligned                  0
__alignment_not_unique         0

I can use filter's regex option to get only the lines starting with two underscores:

df.filter(regex="^__", axis=0)

This returns the following:

                           count
gene                            
__no_feature            11697881
__ambiguous                 1353
__too_low_aQual                0
__not_aligned                  0
__alignment_not_unique         0

Actually, what I want is to have the complement: Only those lines that do not start with two underscores.

I can do it with another regular expression: df.filter(regex="^[^_][^_]", axis=0).

Is there a way to more simply specify that I want the inverse of the initial regular expression?

Is such regexp-based filtering efficient?

Edit: Testing some proposed solutions

df.filter(regex="(?!^__)", axis=0) and df.filter(regex="^\w+", axis=0) both return all lines.

According to the re module documentation, the \w special character actually includes the underscore, which explains the behaviour of the second expression.

I guess that the first one doesn't work because the (?!...) applies on what follows a pattern. Here, "^" should be put outside, as in the following proposed solution:

df.filter(regex="^(?!__).*?$", axis=0) works.

So does df.filter(regex="^(?!__)", axis=0).

Which syntax is correct for filter in pandas?

Pandas DataFrame: filter() function The filter is applied to the labels of the index. Keep labels from axis which are in items. Keep labels from axis for which “like in label == True”. Keep labels from axis for which re.search(regex, label) == True.

Which of the following will filter rows in a Pandas DataFrame?

You can filter the Rows from pandas DataFrame based on a single condition or multiple conditions either using DataFrame. loc[] attribute, DataFrame. query(), or DataFrame. apply() method.

I had the same problem but I wanted to filter the columns. Thus I am using axis=1 but concept should be similar.

df.drop(df.filter(regex='my_expression').columns,axis=1)

Matching all lines with no two leading underscores:

^(?!__)

^ matches the beginning of the line (?!__)makes sure the line (what follows the preceding ^ match) does not begin with two underscores

Edit: dropped the .*?$ since it's not necessary to filter the lines.

You have two possibilities here:

(?!^__) # a negative lookahead
        # making sure that there are no underscores right at the beginning of the line

Or:

^\w+  # match word characters, aka a-z, A-Z, 0-9 at least once

How to invert a regular expression in pandas filter function

Tags:

python

regex

pandas

Edit: Testing some proposed solutions

bli

People also ask

3 Answers

harsshal

Robin Koch

Jan

Recent Activity

Donate For Us

How to invert a regular expression in pandas filter function

Tags:

python

regex

pandas

Edit: Testing some proposed solutions

bli

People also ask

3 Answers

harsshal

Robin Koch

Jan

Related questions

Recent Activity

Donate For Us