I want to extract hash tags from tweets using R's regular expression (I'd like to keep this in base R, but other solutions are welcome for robustness of the answer for future searchers).
I have a regex I thought would remove hash tags but found the corner case of when there's a #
in a url as is demoed in the MWE below. How can I remove hash tags in text but keep the # in a URL?
Here is a MWE and the code I've tried:
text.var <- c("Slides from great talk: @ramnath_vaidya: Interactive slides from Interactive Visualization",
"presentation #user2014. http://ramnathv.github.io/user2014-rcharts/#1")
gsub("#\\w+", "", text.var)
gsub("#\\S+", "", text.var)
The desired output is:
[1] "Slides from great talk: @ramnath_vaidya: Interactive slides from Interactive Visualization"
[2] "presentation . http://ramnathv.github.io/user2014-rcharts/#1"
Note R's regular expressions are similar to other regex but are specific to R. This question is specific to R's regex not a general regex question.
Well, for this specific case you can use a Negative Lookbehind assertion.
gsub('(?<!/)#\\w+', '', text.var, perl=T)
# [1] "Slides from great talk: @ramnath_vaidya: Interactive slides from Interactive Visualization"
# [2] "presentation . http://ramnathv.github.io/user2014-rcharts/#1"
Or you can use some dark magic that PCRE
offers:
gsub('http://\\S+(*SKIP)(*F)|#\\w+', '', text.var, perl=T)
# [1] "Slides from great talk: @ramnath_vaidya: Interactive slides from Interactive Visualization"
# [2] "presentation . http://ramnathv.github.io/user2014-rcharts/#1"
The idea here is to skip any url that starts with http://
, which you can tweak if you need to.
On the left side of the alternation operator we match a url making the subpattern fail, forcing the regular expression engine to not retry the substring using backtracking control skipping to the next position in the string. The right side of the alternation operator matches what we want...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With