Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I extract hashtags from tweets in R?

Tags:

regex

r

tweets

I know this question had been asked here and here but there was a small problem when I tried it out:

x<- str_extract("Hello peopllz! My new home is #crazy gr8! #wow", "#\S+")
Error: '\S' is an unrecognized escape in character string starting "#\S"

I changed the regex to "#(.+) ?", "#\\s", but they did not extract the hashtags.

I then tried the gsub way:

x<- gsub("[^#(.+) ?]","","Hello! #London is gr8. #Wow")

It gave: " # . #"

Any ideas where I am going wrong? I'd like my output as a vector/list of all the hashtags in the tweet(without the hashes!)

Edit: I would prefer not tokenizing the tweet, because: 1. I am not tokenizing the tweets for the rest of my program, 2. It would become a very expensive step were I to scale it to handle large volumes of tweets.

like image 574
jackStinger Avatar asked Dec 07 '12 12:12

jackStinger


People also ask

How do I extract hashtags from twitter?

Step-by-step Approach:Import required modules. Create an explicit function to display tweet data. Create another function to scrape data regarding a given Hashtag using tweepy module. In the Driver Code assign Twitter Developer account credentials along with the Hashtag, initial date and number of tweets.

How do I use Rtweet?

To use rtweet, you need a Twitter account so you can authorize rtweet to use your specific account credentials. That's because there is a limit to how many tweets you can download in a 15-minute period. Michael Kearney, who wrote rtweet, gives rtweet users two choices. The easiest way is to simply request some tweets.


2 Answers

Use "#\\S+" instead of "#\S+".

str_extract_all("Hello peopllz! My new home is #crazy gr8! #wow", "#\\S+")
# [[1]]
# [1] "#crazy" "#wow"  

There are two levels of parsing going on here. Before the low level regexp function within str_extract gets the pattern you want to search for (i.e. "#\S+") it is first parsed by R. R does not recognize \S as a valid escape character and throws an error. By escaping the slash with \\ you tell R to pass the \ and S as two normal characters to the regexp function, instead of interpreting it as one escape character.

Side track

This can produce rather bizarre expressions. Imagine that you have a list of addresses to computers on a windows network on the form of "\\computer". To search for it you would need to type str_extract(adr, "\\\\\\w+") which would turn into "\\\w+" internally and then search for.

like image 188
Backlin Avatar answered Sep 19 '22 12:09

Backlin


Just chiming in. Depending on how you access the twitter data, this information may already be parsed for you. For example, if you access the sample stream, the raw JSON format has an entry that parses the references, tags, etc., as an array for you. See twitter api documentation here.

like image 35
Btibert3 Avatar answered Sep 20 '22 12:09

Btibert3