Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

hash tags regex, keep # in url

Tags:

regex

r

I want to extract hash tags from tweets using R's regular expression (I'd like to keep this in base R, but other solutions are welcome for robustness of the answer for future searchers).

I have a regex I thought would remove hash tags but found the corner case of when there's a # in a url as is demoed in the MWE below. How can I remove hash tags in text but keep the # in a URL?

Here is a MWE and the code I've tried:

text.var <- c("Slides from great talk: @ramnath_vaidya: Interactive slides from Interactive Visualization", 
    "presentation #user2014. http://ramnathv.github.io/user2014-rcharts/#1")

gsub("#\\w+", "", text.var)
gsub("#\\S+", "", text.var)

The desired output is:

[1] "Slides from great talk: @ramnath_vaidya: Interactive slides from Interactive Visualization"
[2] "presentation . http://ramnathv.github.io/user2014-rcharts/#1"

Note R's regular expressions are similar to other regex but are specific to R. This question is specific to R's regex not a general regex question.

like image 394
Tyler Rinker Avatar asked Sep 12 '25 07:09

Tyler Rinker


1 Answers

Well, for this specific case you can use a Negative Lookbehind assertion.

gsub('(?<!/)#\\w+', '', text.var, perl=T)
# [1] "Slides from great talk: @ramnath_vaidya: Interactive slides from Interactive Visualization"
# [2] "presentation . http://ramnathv.github.io/user2014-rcharts/#1" 

Or you can use some dark magic that PCRE offers:

gsub('http://\\S+(*SKIP)(*F)|#\\w+', '', text.var, perl=T)
# [1] "Slides from great talk: @ramnath_vaidya: Interactive slides from Interactive Visualization"
# [2] "presentation . http://ramnathv.github.io/user2014-rcharts/#1"    

The idea here is to skip any url that starts with http://, which you can tweak if you need to.

On the left side of the alternation operator we match a url making the subpattern fail, forcing the regular expression engine to not retry the substring using backtracking control skipping to the next position in the string. The right side of the alternation operator matches what we want...

like image 68
hwnd Avatar answered Sep 13 '25 23:09

hwnd