Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas gives an error from str.extractall('#')

Tags:

python

pandas

I am trying to filter all the # keywords from the tweet text. I am using str.extractall() to extract all the keywords with # keywords. This is the first time I am working on filtering keywords from the tweetText using pandas. Inputs, code, expected output and error are given below.

Input:

userID,tweetText 
01, home #sweet home
01, #happy #life 
02, #world peace
03, #all are one
04, world tour

and so on... the total datafile is in GB size scraped tweets with several other columns. But I am interested in only two columns.

Code:

import re
import pandas as pd

data = pd.read_csv('Text.csv', index_col=0, header=None, names=['userID', 'tweetText'])

fout = data['tweetText'].str.extractall('#')

print fout 

Expected Output:

userID,tweetText 
01,#sweet
01,#happy 
01,#life 
02,#world
03,#all

Error:

Traceback (most recent call last):
  File "keyword_split.py", line 7, in <module>
    fout = data['tweetText'].str.extractall('#')
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/strings.py", line 1621, in extractall
    return str_extractall(self._orig, pat, flags=flags)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/strings.py", line 694, in str_extractall
    raise ValueError("pattern contains no capture groups")
ValueError: pattern contains no capture groups

Thanks in advance for the help. What should be the simplest way to filter keywords with respect to userid?

Output Update:

When used only this the output is like above s.name = "tweetText" data_1 = data[~data['tweetText'].isnull()]

The output in this case has empty [] and the userID at still listed and for those which has keywords has an array of keywords and not in list form.

When used only this the output us what needed but with NAN

s.name = "tweetText"
data_2 = data_1.drop('tweetText', axis=1).join(s)

The output here is correct format but those with no keywords has yet considered and has NAN

If it is possible we got to neglect such userIDs and not shown in output at all.In next stages I am trying to calculate the frequency of keywords in which the NAN or empty [] will also be counted and that frequency may compromise the far future classification.

enter image description here

like image 765
Sitz Blogz Avatar asked Jul 24 '16 13:07

Sitz Blogz


People also ask

What does extractall do?

extractall() function is used to extract capture groups in the regex pat as columns in a DataFrame. For each subject string in the Series, extract groups from all matches of regular expression pat. When each subject string in the Series has exactly one match, extractall(pat).

What is value error in pandas?

One of the most commonly reported error in pandas is ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all() and it may sometimes be quite tricky to deal with, especially if you are new to pandas library (or even Python).


3 Answers

Set braces in your calculus :

fout = data['tweetText'].str.extractall('(#)')

instead of

fout = data['tweetText'].str.extractall('#')

Hope that will work

like image 71
Guillaume Ottavianoni Avatar answered Oct 20 '22 02:10

Guillaume Ottavianoni


If you are not too tied to using extractall, you can try the following to get your final output:

from io import StringIO
import pandas as pd
import re


data_text = """userID,tweetText
01, home #sweet home
01, #happy #life 
02, #world peace
03, #all are one
"""

data = pd.read_csv(StringIO(data_text),header=0)

data['tweetText'] = data.tweetText.apply(lambda x: re.findall('#(?=\w+)\w+',x))
s = data.apply(lambda x: pd.Series(x['tweetText']),axis=1).stack().reset_index(level=1, drop=True)
s.name = "tweetText"
data = data.drop('tweetText', axis=1).join(s)

     userID tweetText
0       1    #sweet
1       1    #happy
1       1     #life
2       2    #world
3       3      #all
4       4       NaN

You drop the rows where the textTweet column returns Nan's by doing the following:

data = data[~data['tweetText'].isnull()]

This should return:

   userID tweetText
0       1    #sweet
1       1    #happy
1       1     #life
2       2    #world
3       3      #all

I hope this helps.

like image 4
Abdou Avatar answered Oct 20 '22 02:10

Abdou


The extractall function requires a regex pattern with capturing groups as the first argument, for which you have provided #.

A possible argument could be (#\S+). The braces indicate a capture group, in other words what the extractall function needs to extract from each string.

Example:

data="""01, home #sweet home
01, #happy #life 
02, #world peace
03, #all are one
"""

import pandas as pd
from io import StringIO

df = pd.read_csv(StringIO(data), 
                 header=None, 
                 names=['col1', 'col2'],
                 index_col=0)

df['col2'].str.extractall('(#\S+)')

The error ValueError: pattern contains no capture groups doesn't appear anymore with the above code (meaning the issue in the question is solved), but this hits a bug in the current version of pandas (I'm using '0.18.1').

The error returned is:

AssertionError: 1 columns passed, passed data had 6 columns

The issue is described here.

If you would try df['col2'].str.extractall('#(\S)')(which will give you the first letter of every hashtag, you'll see that the extractall function works as long as the captured group only contains a single character (which matches the issue description). As the issue is closed, it should be fixed in an upcoming pandas release.

like image 3
DocZerø Avatar answered Oct 20 '22 02:10

DocZerø