Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use regex non-capturing groups format in Python

In the following code I want to get just the digits between '-' and 'u'. I thought i could apply regular expression non capturing groups format (?: … ) to ignore everything from '-' to the first digit. But output always include it. How can i use noncapturing groups format to generate correct ouput?

df = pd.DataFrame(
    {'a' : [1,2,3,4], 
     'b' : ['41u -428u', '31u - 68u', '11u - 58u', '21u - 318u']
    })

df['b'].str.extract('((?:-[ ]*)[0-9]*)', expand=True)

enter image description here enter image description here

like image 585
StackUser Avatar asked May 18 '18 18:05

StackUser


People also ask

How do you write a non-capturing group in regex?

Sometimes, you may want to create a group but don't want to capture it in the groups of the match. To do that, you can use a non-capturing group with the following syntax: (?:X)

What is the point of non-capturing group in regex?

They can help you to extract exact information from a bigger match (which can also be named), they let you rematch a previous matched group, and can be used for substitutions.

What are capturing groups in regex Python?

Capturing groups are a handy feature of regular expression matching that allows us to query the Match object to find out the part of the string that matched against a particular part of the regular expression. Anything you have in parentheses () will be a capture group.

How do I match a group in regex?

Capturing groups are a way to treat multiple characters as a single unit. They are created by placing the characters to be grouped inside a set of parentheses. For example, the regular expression (dog) creates a single group containing the letters "d", "o", and "g".


2 Answers

It isn't included in the inner group, but it's still included as part of the outer group. A non-capturing group does't necessarily imply it isn't captured at all... just that that group does not explicitly get saved in the output. It is still captured as part of any enclosing groups.

Just do not put them into the () that define the capturing:

import pandas as pd

df = pd.DataFrame(
    {'a' : [1,2,3,4], 
     'b' : ['41u -428u', '31u - 68u', '11u - 58u', '21u - 318u']
    })

df['b'].str.extract(r'- ?(\d+)u', expand=True)

     0
0  428
1   68
2   58
3  318

That way you match anything that has a '-' in front (mabye followed by a aspace), a 'u' behind and numbers between the both.

Where,

-      # literal hyphen
\s?    # optional space—or you could go with \s* if you expect more than one
(\d+)  # capture one or more digits 
u      # literal "u"
like image 184
Patrick Artner Avatar answered Sep 25 '22 15:09

Patrick Artner


I think you're trying too complicated a regex. What about:

df['b'].str.extract(r'-(.*)u', expand=True)

      0
0   428
1    68
2    58
3   318
like image 41
sacuL Avatar answered Sep 22 '22 15:09

sacuL