I am trying to filter all the #
keywords from the tweet text. I am using str.extractall()
to extract all the keywords with #
keywords.
This is the first time I am working on filtering keywords from the tweetText using pandas. Inputs, code, expected output and error are given below.
Input:
userID,tweetText
01, home #sweet home
01, #happy #life
02, #world peace
03, #all are one
04, world tour
and so on... the total datafile is in GB size scraped tweets with several other columns. But I am interested in only two columns.
Code:
import re
import pandas as pd
data = pd.read_csv('Text.csv', index_col=0, header=None, names=['userID', 'tweetText'])
fout = data['tweetText'].str.extractall('#')
print fout
Expected Output:
userID,tweetText
01,#sweet
01,#happy
01,#life
02,#world
03,#all
Error:
Traceback (most recent call last):
File "keyword_split.py", line 7, in <module>
fout = data['tweetText'].str.extractall('#')
File "/usr/local/lib/python2.7/dist-packages/pandas/core/strings.py", line 1621, in extractall
return str_extractall(self._orig, pat, flags=flags)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/strings.py", line 694, in str_extractall
raise ValueError("pattern contains no capture groups")
ValueError: pattern contains no capture groups
Thanks in advance for the help. What should be the simplest way to filter keywords with respect to userid?
Output Update:
When used only this the output is like above
s.name = "tweetText"
data_1 = data[~data['tweetText'].isnull()]
The output in this case has empty []
and the userID at still listed and for those which has keywords has an array of keywords and not in list form.
When used only this the output us what needed but with NAN
s.name = "tweetText"
data_2 = data_1.drop('tweetText', axis=1).join(s)
The output here is correct format but those with no keywords has yet considered and has NAN
If it is possible we got to neglect such userIDs and not shown in output at all.In next stages I am trying to calculate the frequency of keywords in which the NAN
or empty []
will also be counted and that frequency may compromise the far future classification.
extractall() function is used to extract capture groups in the regex pat as columns in a DataFrame. For each subject string in the Series, extract groups from all matches of regular expression pat. When each subject string in the Series has exactly one match, extractall(pat).
One of the most commonly reported error in pandas is ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all() and it may sometimes be quite tricky to deal with, especially if you are new to pandas library (or even Python).
Set braces in your calculus :
fout = data['tweetText'].str.extractall('(#)')
instead of
fout = data['tweetText'].str.extractall('#')
Hope that will work
If you are not too tied to using extractall
, you can try the following to get your final output:
from io import StringIO
import pandas as pd
import re
data_text = """userID,tweetText
01, home #sweet home
01, #happy #life
02, #world peace
03, #all are one
"""
data = pd.read_csv(StringIO(data_text),header=0)
data['tweetText'] = data.tweetText.apply(lambda x: re.findall('#(?=\w+)\w+',x))
s = data.apply(lambda x: pd.Series(x['tweetText']),axis=1).stack().reset_index(level=1, drop=True)
s.name = "tweetText"
data = data.drop('tweetText', axis=1).join(s)
userID tweetText
0 1 #sweet
1 1 #happy
1 1 #life
2 2 #world
3 3 #all
4 4 NaN
You drop the rows where the textTweet column returns Nan
's by doing the following:
data = data[~data['tweetText'].isnull()]
This should return:
userID tweetText
0 1 #sweet
1 1 #happy
1 1 #life
2 2 #world
3 3 #all
I hope this helps.
The extractall
function requires a regex pattern with capturing groups as the first argument, for which you have provided #
.
A possible argument could be (#\S+)
. The braces indicate a capture group, in other words what the extractall
function needs to extract from each string.
Example:
data="""01, home #sweet home
01, #happy #life
02, #world peace
03, #all are one
"""
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(data),
header=None,
names=['col1', 'col2'],
index_col=0)
df['col2'].str.extractall('(#\S+)')
The error ValueError: pattern contains no capture groups
doesn't appear anymore with the above code (meaning the issue in the question is solved), but this hits a bug in the current version of pandas (I'm using '0.18.1'
).
The error returned is:
AssertionError: 1 columns passed, passed data had 6 columns
The issue is described here.
If you would try df['col2'].str.extractall('#(\S)')
(which will give you the first letter of every hashtag, you'll see that the extractall
function works as long as the captured group only contains a single character (which matches the issue description). As the issue is closed, it should be fixed in an upcoming pandas release.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With