In the following code I want to get just the digits between '-' and 'u'. I thought i could apply regular expression non capturing groups format (?: … ) to ignore everything from '-' to the first digit. But output always include it. How can i use noncapturing groups format to generate correct ouput?
df = pd.DataFrame(
{'a' : [1,2,3,4],
'b' : ['41u -428u', '31u - 68u', '11u - 58u', '21u - 318u']
})
df['b'].str.extract('((?:-[ ]*)[0-9]*)', expand=True)
Sometimes, you may want to create a group but don't want to capture it in the groups of the match. To do that, you can use a non-capturing group with the following syntax: (?:X)
They can help you to extract exact information from a bigger match (which can also be named), they let you rematch a previous matched group, and can be used for substitutions.
Capturing groups are a handy feature of regular expression matching that allows us to query the Match object to find out the part of the string that matched against a particular part of the regular expression. Anything you have in parentheses () will be a capture group.
Capturing groups are a way to treat multiple characters as a single unit. They are created by placing the characters to be grouped inside a set of parentheses. For example, the regular expression (dog) creates a single group containing the letters "d", "o", and "g".
It isn't included in the inner group, but it's still included as part of the outer group. A non-capturing group does't necessarily imply it isn't captured at all... just that that group does not explicitly get saved in the output. It is still captured as part of any enclosing groups.
Just do not put them into the ()
that define the capturing:
import pandas as pd
df = pd.DataFrame(
{'a' : [1,2,3,4],
'b' : ['41u -428u', '31u - 68u', '11u - 58u', '21u - 318u']
})
df['b'].str.extract(r'- ?(\d+)u', expand=True)
0
0 428
1 68
2 58
3 318
That way you match anything that has a '-'
in front (mabye followed by a aspace), a 'u'
behind and numbers between the both.
Where,
- # literal hyphen
\s? # optional space—or you could go with \s* if you expect more than one
(\d+) # capture one or more digits
u # literal "u"
I think you're trying too complicated a regex. What about:
df['b'].str.extract(r'-(.*)u', expand=True)
0
0 428
1 68
2 58
3 318
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With