Pandas gives an error from str.extractall('#')

Tags:

I am trying to filter all the # keywords from the tweet text. I am using str.extractall() to extract all the keywords with # keywords. This is the first time I am working on filtering keywords from the tweetText using pandas. Inputs, code, expected output and error are given below.

Input:

userID,tweetText 
01, home #sweet home
01, #happy #life 
02, #world peace
03, #all are one
04, world tour

and so on... the total datafile is in GB size scraped tweets with several other columns. But I am interested in only two columns.

Code:

import re
import pandas as pd

data = pd.read_csv('Text.csv', index_col=0, header=None, names=['userID', 'tweetText'])

fout = data['tweetText'].str.extractall('#')

print fout

Expected Output:

userID,tweetText 
01,#sweet
01,#happy 
01,#life 
02,#world
03,#all

Error:

Traceback (most recent call last):
  File "keyword_split.py", line 7, in <module>
    fout = data['tweetText'].str.extractall('#')
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/strings.py", line 1621, in extractall
    return str_extractall(self._orig, pat, flags=flags)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/strings.py", line 694, in str_extractall
    raise ValueError("pattern contains no capture groups")
ValueError: pattern contains no capture groups

Thanks in advance for the help. What should be the simplest way to filter keywords with respect to userid?

Output Update:

When used only this the output is like above s.name = "tweetText" data_1 = data[~data['tweetText'].isnull()]

The output in this case has empty [] and the userID at still listed and for those which has keywords has an array of keywords and not in list form.

When used only this the output us what needed but with NAN

s.name = "tweetText"
data_2 = data_1.drop('tweetText', axis=1).join(s)

The output here is correct format but those with no keywords has yet considered and has NAN

If it is possible we got to neglect such userIDs and not shown in output at all.In next stages I am trying to calculate the frequency of keywords in which the NAN or empty [] will also be counted and that frequency may compromise the far future classification.

enter image description here

765

asked Jul 24 '16 13:07

Sitz Blogz

3 Answers

Set braces in your calculus :

fout = data['tweetText'].str.extractall('(#)')

instead of

fout = data['tweetText'].str.extractall('#')

Hope that will work

answered Oct 20 '22 02:10

Guillaume Ottavianoni

If you are not too tied to using extractall, you can try the following to get your final output:

from io import StringIO
import pandas as pd
import re


data_text = """userID,tweetText
01, home #sweet home
01, #happy #life 
02, #world peace
03, #all are one
"""

data = pd.read_csv(StringIO(data_text),header=0)

data['tweetText'] = data.tweetText.apply(lambda x: re.findall('#(?=\w+)\w+',x))
s = data.apply(lambda x: pd.Series(x['tweetText']),axis=1).stack().reset_index(level=1, drop=True)
s.name = "tweetText"
data = data.drop('tweetText', axis=1).join(s)

     userID tweetText
0       1    #sweet
1       1    #happy
1       1     #life
2       2    #world
3       3      #all
4       4       NaN

You drop the rows where the textTweet column returns Nan's by doing the following:

data = data[~data['tweetText'].isnull()]

This should return:

   userID tweetText
0       1    #sweet
1       1    #happy
1       1     #life
2       2    #world
3       3      #all

I hope this helps.

answered Oct 20 '22 02:10

Abdou

The extractall function requires a regex pattern with capturing groups as the first argument, for which you have provided #.

A possible argument could be (#\S+). The braces indicate a capture group, in other words what the extractall function needs to extract from each string.

Example:

data="""01, home #sweet home
01, #happy #life 
02, #world peace
03, #all are one
"""

import pandas as pd
from io import StringIO

df = pd.read_csv(StringIO(data), 
                 header=None, 
                 names=['col1', 'col2'],
                 index_col=0)

df['col2'].str.extractall('(#\S+)')

The error ValueError: pattern contains no capture groups doesn't appear anymore with the above code (meaning the issue in the question is solved), but this hits a bug in the current version of pandas (I'm using '0.18.1').

The error returned is:

AssertionError: 1 columns passed, passed data had 6 columns

The issue is described here.

If you would try df['col2'].str.extractall('#(\S)')(which will give you the first letter of every hashtag, you'll see that the extractall function works as long as the captured group only contains a single character (which matches the issue description). As the issue is closed, it should be fixed in an upcoming pandas release.

answered Oct 20 '22 02:10

DocZerø

Related questions
                            
                                Python string replacement [duplicate]
                            
                                Does ipython notebook 'run all cells' execute simultaneously or in sequence?
                            
                                Cox regression python
                            
                                Google App Engine: ImportError: No module named appengine.ext
                            
                                Search Everywhere for Comments in Pycharm
                            
                                Name of and reason for Python function parameters of type `name=value`
                            
                                Cryptography module is Fernet safe and can i do AES encryption with that module?
                            
                                Why can't I send `None` as data in a POST request using Python's `requests` library?
                            
                                How to create object of derived class inside base class in Python?
                            
                                Can lambda work with *args as its parameter? [duplicate]
                            
                                BigInts seem slow in Julia
                            
                                Transforming Dataframe columns into Dataframe of rows
                            
                                runspider: error: File not found: - Scrapy
                            
                                Pythonic way to break out of loop
                            
                                pandas - scatter plot with different color legend for each point
                            
                                Spark SQL performance - JOIN on value BETWEEN min and max
                            
                                How to receive file_id through python-telegram-bot?
                            
                                Random access over all pair-wise combinations of large list in Python
                            
                                Capitalization of filenames storing Python classes
                            
                                subclassing dict; dict.update returns incorrrect value - python bug?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas gives an error from str.extractall('#')

Tags:

python

pandas

Sitz Blogz

People also ask

3 Answers

Guillaume Ottavianoni

Abdou

DocZerø

Recent Activity

Donate For Us