Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: UserWarning: This pattern has match groups. To actually get the groups, use str.extract

I have a dataframe and I try to get string, where on of column contain some string Df looks like

member_id,event_path,event_time,event_duration 30595,"2016-03-30 12:27:33",yandex.ru/,1 30595,"2016-03-30 12:31:42",yandex.ru/,0 30595,"2016-03-30 12:31:43",yandex.ru/search/?lr=10738&msid=22901.25826.1459330364.89548&text=%D1%84%D0%B8%D0%BB%D1%8C%D0%BC%D1%8B+%D0%BE%D0%BD%D0%BB%D0%B0%D0%B9%D0%BD&suggest_reqid=168542624144922467267026838391360&csg=3381%2C3938%2C2%2C3%2C1%2C0%2C0,0 30595,"2016-03-30 12:31:44",yandex.ru/search/?lr=10738&msid=22901.25826.1459330364.89548&text=%D1%84%D0%B8%D0%BB%D1%8C%D0%BC%D1%8B+%D0%BE%D0%BD%D0%BB%D0%B0%D0%B9%D0%BD&suggest_reqid=168542624144922467267026838391360&csg=3381%2C3938%2C2%2C3%2C1%2C0%2C0,0 30595,"2016-03-30 12:31:45",yandex.ru/search/?lr=10738&msid=22901.25826.1459330364.89548&text=%D1%84%D0%B8%D0%BB%D1%8C%D0%BC%D1%8B+%D0%BE%D0%BD%D0%BB%D0%B0%D0%B9%D0%BD&suggest_reqid=168542624144922467267026838391360&csg=3381%2C3938%2C2%2C3%2C1%2C0%2C0,0 30595,"2016-03-30 12:31:46",yandex.ru/search/?lr=10738&msid=22901.25826.1459330364.89548&text=%D1%84%D0%B8%D0%BB%D1%8C%D0%BC%D1%8B+%D0%BE%D0%BD%D0%BB%D0%B0%D0%B9%D0%BD&suggest_reqid=168542624144922467267026838391360&csg=3381%2C3938%2C2%2C3%2C1%2C0%2C0,0 30595,"2016-03-30 12:31:49",kinogo.co/,1 30595,"2016-03-30 12:32:11",kinogo.co/melodramy/,0 

And another df with urls

url 003\.ru\/[a-zA-Z0-9-_%$#?.:+=|()]+\/mobilnyj_telefon_bq_phoenix 003\.ru\/[a-zA-Z0-9-_%$#?.:+=|()]+\/mobilnyj_telefon_fly_ 003\.ru\/sonyxperia 003\.ru\/[a-zA-Z0-9-_%$#?.:+=|()]+\/mobilnye_telefony_smartfony 003\.ru\/[a-zA-Z0-9-_%$#?.:+=|()]+\/mobilnye_telefony_smartfony\/brands5D5Bbr_23 1click\.ru\/sonyxperia 1click\.ru\/[a-zA-Z0-9-_%$#?.:+=|()]+\/chasy-motorola 

I use

urls = pd.read_csv('relevant_url1.csv', error_bad_lines=False) substr = urls.url.values.tolist() data = pd.read_csv('data_nts2.csv', error_bad_lines=False, chunksize=50000) result = pd.DataFrame() for i, df in enumerate(data):     res = df[df['event_time'].str.contains('|'.join(substr), regex=True)] 

but it return me

UserWarning: This pattern has match groups. To actually get the groups, use str.extract. 

How can I fix that?

like image 407
Petr Petrov Avatar asked Oct 06 '16 16:10

Petr Petrov


People also ask

What does STR extract do in Python?

The str. extract() function is used to extract capture groups in the regex pat as columns in a DataFrame. For each subject string in the Series, extract groups from the first match of regular expression pat. Regular expression pattern with capturing groups.

How do you check if a string matches a pattern in Python?

Method : Using join regex + loop + re.match() In this, we create a new regex string by joining all the regex list and then match the string against it to check for match using match() with any of the element of regex list.

What is Match Group () in Python?

re.MatchObject.group() method returns the complete matched subgroup by default or a tuple of matched subgroups depending on the number of arguments.

How do you find the substring that matched the last capturing group of the regex?

To get access to the text matched by each regex group, pass the group's number to the group(group_number) method. So the first group will be a group of 1. The second group will be a group of 2 and so on. So this is the simple way to access each of the groups as long as the patterns were matched.


2 Answers

The alternative way to get rid of the warning is change the regex so that it is a matching group and not a capturing group. That is the (?:) notation.

Thus, if the matching group is (url1|url2) it should be replaced by (?:url1|url2).

like image 50
climatebrad Avatar answered Sep 22 '22 07:09

climatebrad


At least one of the regex patterns in urls must use a capturing group. str.contains only returns True or False for each row in df['event_time'] -- it does not make use of the capturing group. Thus, the UserWarning is alerting you that the regex uses a capturing group but the match is not used.

If you wish to remove the UserWarning you could find and remove the capturing group from the regex pattern(s). They are not shown in the regex patterns you posted, but they must be there in your actual file. Look for parentheses outside of the character classes.

Alternatively, you could suppress this particular UserWarning by putting

import warnings warnings.filterwarnings("ignore", 'This pattern has match groups') 

before the call to str.contains.


Here is a simple example which demonstrates the problem (and solution):

# import warnings # warnings.filterwarnings("ignore", 'This pattern has match groups') # uncomment to suppress the UserWarning  import pandas as pd  df = pd.DataFrame({ 'event_time': ['gouda', 'stilton', 'gruyere']})  urls = pd.DataFrame({'url': ['g(.*)']})   # With a capturing group, there is a UserWarning # urls = pd.DataFrame({'url': ['g.*']})   # Without a capturing group, there is no UserWarning. Uncommenting this line avoids the UserWarning.  substr = urls.url.values.tolist() df[df['event_time'].str.contains('|'.join(substr), regex=True)] 

prints

  script.py:10: UserWarning: This pattern has match groups. To actually get the groups, use str.extract.   df[df['event_time'].str.contains('|'.join(substr), regex=True)] 

Removing the capturing group from the regex pattern:

urls = pd.DataFrame({'url': ['g.*']})    

avoids the UserWarning.

like image 36
unutbu Avatar answered Sep 20 '22 07:09

unutbu