I have a dataframe and I try to get string, where on of column contain some string Df looks like <pre class="prettyprint"><code>member_id,event_path,event_time,event_duration 30595,"2016-03-30 12:27:33",yandex.ru/,1 30595,"2016-03-30 12:31:42",yandex.ru/,0 30595,"2016-03-30 12:31:43",yandex.ru/search/?lr=10738&msid=22901.25826.1459330364.89548&text=%D1%84%D0%B8%D0%BB%D1%8C%D0%BC%D1%8B+%D0%BE%D0%BD%D0%BB%D0%B0%D0%B9%D0%BD&suggest_reqid=168542624144922467267026838391360&csg=3381%2C3938%2C2%2C3%2C1%2C0%2C0,0 30595,"2016-03-30 12:31:44",yandex.ru/search/?lr=10738&msid=22901.25826.1459330364.89548&text=%D1%84%D0%B8%D0%BB%D1%8C%D0%BC%D1%8B+%D0%BE%D0%BD%D0%BB%D0%B0%D0%B9%D0%BD&suggest_reqid=168542624144922467267026838391360&csg=3381%2C3938%2C2%2C3%2C1%2C0%2C0,0 30595,"2016-03-30 12:31:45",yandex.ru/search/?lr=10738&msid=22901.25826.1459330364.89548&text=%D1%84%D0%B8%D0%BB%D1%8C%D0%BC%D1%8B+%D0%BE%D0%BD%D0%BB%D0%B0%D0%B9%D0%BD&suggest_reqid=168542624144922467267026838391360&csg=3381%2C3938%2C2%2C3%2C1%2C0%2C0,0 30595,"2016-03-30 12:31:46",yandex.ru/search/?lr=10738&msid=22901.25826.1459330364.89548&text=%D1%84%D0%B8%D0%BB%D1%8C%D0%BC%D1%8B+%D0%BE%D0%BD%D0%BB%D0%B0%D0%B9%D0%BD&suggest_reqid=168542624144922467267026838391360&csg=3381%2C3938%2C2%2C3%2C1%2C0%2C0,0 30595,"2016-03-30 12:31:49",kinogo.co/,1 30595,"2016-03-30 12:32:11",kinogo.co/melodramy/,0 </code></pre> And another df with urls <pre class="prettyprint"><code>url 003\.ru\/[a-zA-Z0-9-_%$#?.:+=|()]+\/mobilnyj_telefon_bq_phoenix 003\.ru\/[a-zA-Z0-9-_%$#?.:+=|()]+\/mobilnyj_telefon_fly_ 003\.ru\/sonyxperia 003\.ru\/[a-zA-Z0-9-_%$#?.:+=|()]+\/mobilnye_telefony_smartfony 003\.ru\/[a-zA-Z0-9-_%$#?.:+=|()]+\/mobilnye_telefony_smartfony\/brands5D5Bbr_23 1click\.ru\/sonyxperia 1click\.ru\/[a-zA-Z0-9-_%$#?.:+=|()]+\/chasy-motorola </code></pre> I use <pre class="prettyprint"><code>urls = pd.read_csv('relevant_url1.csv', error_bad_lines=False) substr = urls.url.values.tolist() data = pd.read_csv('data_nts2.csv', error_bad_lines=False, chunksize=50000) result = pd.DataFrame() for i, df in enumerate(data): res = df[df['event_time'].str.contains('|'.join(substr), regex=True)] </code></pre> but it return me <pre class="prettyprint"><code>UserWarning: This pattern has match groups. To actually get the groups, use str.extract. </code></pre> How can I fix that?

The alternative way to get rid of the warning is change the regex so that it is a matching group and not a capturing group. That is the <code>(?:)</code> notation. Thus, if the matching group is <code>(url1|url2)</code> it should be replaced by <code>(?:url1|url2)</code>.

Python: UserWarning: This pattern has match groups. To actually get the groups, use str.extract

Tags:

python

regex

pandas

I have a dataframe and I try to get string, where on of column contain some string Df looks like

member_id,event_path,event_time,event_duration 30595,"2016-03-30 12:27:33",yandex.ru/,1 30595,"2016-03-30 12:31:42",yandex.ru/,0 30595,"2016-03-30 12:31:43",yandex.ru/search/?lr=10738&msid=22901.25826.1459330364.89548&text=%D1%84%D0%B8%D0%BB%D1%8C%D0%BC%D1%8B+%D0%BE%D0%BD%D0%BB%D0%B0%D0%B9%D0%BD&suggest_reqid=168542624144922467267026838391360&csg=3381%2C3938%2C2%2C3%2C1%2C0%2C0,0 30595,"2016-03-30 12:31:44",yandex.ru/search/?lr=10738&msid=22901.25826.1459330364.89548&text=%D1%84%D0%B8%D0%BB%D1%8C%D0%BC%D1%8B+%D0%BE%D0%BD%D0%BB%D0%B0%D0%B9%D0%BD&suggest_reqid=168542624144922467267026838391360&csg=3381%2C3938%2C2%2C3%2C1%2C0%2C0,0 30595,"2016-03-30 12:31:45",yandex.ru/search/?lr=10738&msid=22901.25826.1459330364.89548&text=%D1%84%D0%B8%D0%BB%D1%8C%D0%BC%D1%8B+%D0%BE%D0%BD%D0%BB%D0%B0%D0%B9%D0%BD&suggest_reqid=168542624144922467267026838391360&csg=3381%2C3938%2C2%2C3%2C1%2C0%2C0,0 30595,"2016-03-30 12:31:46",yandex.ru/search/?lr=10738&msid=22901.25826.1459330364.89548&text=%D1%84%D0%B8%D0%BB%D1%8C%D0%BC%D1%8B+%D0%BE%D0%BD%D0%BB%D0%B0%D0%B9%D0%BD&suggest_reqid=168542624144922467267026838391360&csg=3381%2C3938%2C2%2C3%2C1%2C0%2C0,0 30595,"2016-03-30 12:31:49",kinogo.co/,1 30595,"2016-03-30 12:32:11",kinogo.co/melodramy/,0

And another df with urls

url 003\.ru\/[a-zA-Z0-9-_%$#?.:+=|()]+\/mobilnyj_telefon_bq_phoenix 003\.ru\/[a-zA-Z0-9-_%$#?.:+=|()]+\/mobilnyj_telefon_fly_ 003\.ru\/sonyxperia 003\.ru\/[a-zA-Z0-9-_%$#?.:+=|()]+\/mobilnye_telefony_smartfony 003\.ru\/[a-zA-Z0-9-_%$#?.:+=|()]+\/mobilnye_telefony_smartfony\/brands5D5Bbr_23 1click\.ru\/sonyxperia 1click\.ru\/[a-zA-Z0-9-_%$#?.:+=|()]+\/chasy-motorola

I use

urls = pd.read_csv('relevant_url1.csv', error_bad_lines=False) substr = urls.url.values.tolist() data = pd.read_csv('data_nts2.csv', error_bad_lines=False, chunksize=50000) result = pd.DataFrame() for i, df in enumerate(data):     res = df[df['event_time'].str.contains('|'.join(substr), regex=True)]

but it return me

UserWarning: This pattern has match groups. To actually get the groups, use str.extract.

How can I fix that?

407

asked Oct 06 '16 16:10

Petr Petrov

2 Answers

The alternative way to get rid of the warning is change the regex so that it is a matching group and not a capturing group. That is the (?:) notation.

Thus, if the matching group is (url1|url2) it should be replaced by (?:url1|url2).

answered Sep 22 '22 07:09

climatebrad

At least one of the regex patterns in urls must use a capturing group. str.contains only returns True or False for each row in df['event_time'] -- it does not make use of the capturing group. Thus, the UserWarning is alerting you that the regex uses a capturing group but the match is not used.

If you wish to remove the UserWarning you could find and remove the capturing group from the regex pattern(s). They are not shown in the regex patterns you posted, but they must be there in your actual file. Look for parentheses outside of the character classes.

Alternatively, you could suppress this particular UserWarning by putting

import warnings warnings.filterwarnings("ignore", 'This pattern has match groups')

before the call to str.contains.

Here is a simple example which demonstrates the problem (and solution):

# import warnings # warnings.filterwarnings("ignore", 'This pattern has match groups') # uncomment to suppress the UserWarning  import pandas as pd  df = pd.DataFrame({ 'event_time': ['gouda', 'stilton', 'gruyere']})  urls = pd.DataFrame({'url': ['g(.*)']})   # With a capturing group, there is a UserWarning # urls = pd.DataFrame({'url': ['g.*']})   # Without a capturing group, there is no UserWarning. Uncommenting this line avoids the UserWarning.  substr = urls.url.values.tolist() df[df['event_time'].str.contains('|'.join(substr), regex=True)]

prints

  script.py:10: UserWarning: This pattern has match groups. To actually get the groups, use str.extract.   df[df['event_time'].str.contains('|'.join(substr), regex=True)]

Removing the capturing group from the regex pattern:

urls = pd.DataFrame({'url': ['g.*']})

avoids the UserWarning.

answered Sep 20 '22 07:09

unutbu

Related questions
                            
                                Python | change text color in shell [duplicate]
                            
                                Python print unicode strings in arrays as characters, not code points
                            
                                How to programmatically make a horizontal line in Qt
                            
                                Assigning to variable from parent function: "Local variable referenced before assignment" [duplicate]
                            
                                django submit two different forms with one submit button
                            
                                How to save in *.xlsx long URL in cell using Pandas
                            
                                How to delete an object from a numpy array without knowing the index
                            
                                how do I remove rows with duplicate values of columns in pandas data frame?
                            
                                error: can't start new thread
                            
                                Python's safest method to store and retrieve passwords from a database
                            
                                What’s a good Python profanity filter library? [closed]
                            
                                Creating DataFrame from ElasticSearch Results
                            
                                What is the format in which Django passwords are stored in the database?
                            
                                How to get key value in django template?
                            
                                Converting .jpg images to .png
                            
                                Got continuous is not supported error in RandomForestRegressor
                            
                                Why is this simple conditional expression not working? [duplicate]
                            
                                Test if a python string is printable
                            
                                What is the simplest way to create an empty iterable using yield in Python?
                            
                                plotting value_counts() in seaborn barplot

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With