I have the following pandas dataframe df
(which is actually just the last lines of a much larger one):
count
gene
WBGene00236788 56
WBGene00236807 3
WBGene00249816 12
WBGene00249825 20
WBGene00255543 6
__no_feature 11697881
__ambiguous 1353
__too_low_aQual 0
__not_aligned 0
__alignment_not_unique 0
I can use filter
's regex
option to get only the lines starting with two underscores:
df.filter(regex="^__", axis=0)
This returns the following:
count
gene
__no_feature 11697881
__ambiguous 1353
__too_low_aQual 0
__not_aligned 0
__alignment_not_unique 0
Actually, what I want is to have the complement: Only those lines that do not start with two underscores.
I can do it with another regular expression: df.filter(regex="^[^_][^_]", axis=0)
.
Is there a way to more simply specify that I want the inverse of the initial regular expression?
Is such regexp-based filtering efficient?
df.filter(regex="(?!^__)", axis=0)
and df.filter(regex="^\w+", axis=0)
both return all lines.
According to the re
module documentation, the \w
special character actually includes the underscore, which explains the behaviour of the second expression.
I guess that the first one doesn't work because the (?!...)
applies on what follows a pattern. Here, "^" should be put outside, as in the following proposed solution:
df.filter(regex="^(?!__).*?$", axis=0)
works.
So does df.filter(regex="^(?!__)", axis=0)
.
Pandas DataFrame: filter() function The filter is applied to the labels of the index. Keep labels from axis which are in items. Keep labels from axis for which “like in label == True”. Keep labels from axis for which re.search(regex, label) == True.
You can filter the Rows from pandas DataFrame based on a single condition or multiple conditions either using DataFrame. loc[] attribute, DataFrame. query(), or DataFrame. apply() method.
I had the same problem but I wanted to filter the columns. Thus I am using axis=1 but concept should be similar.
df.drop(df.filter(regex='my_expression').columns,axis=1)
Matching all lines with no two leading underscores:
^(?!__)
^
matches the beginning of the line
(?!__)
makes sure the line (what follows the preceding ^
match) does not begin with two underscores
Edit:
dropped the .*?$
since it's not necessary to filter the lines.
You have two possibilities here:
(?!^__) # a negative lookahead
# making sure that there are no underscores right at the beginning of the line
Or:
^\w+ # match word characters, aka a-z, A-Z, 0-9 at least once
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With