How can I use regex in R to extract Twitter usernames from a string of text?
I've tried
library(stringr)
theString <- '@foobar Foobar! and @foo (@bar) but not [email protected]'
str_extract_all(string=theString,pattern='(?:^|(?:[^-a-zA-Z0-9_]))@([A-Za-z]+[A-Za-z0-9_]+)')
But I end up with @foobar
, @foo
and (@bar
which contains an unwanted parenthesis.
How can I get just @foobar
, @foo
and @bar
as output?
Here's one method that works in R
:
theString <- '@foobar Foobar! and @foo (@bar) but not [email protected]'
theString1 <- unlist(strsplit(theString, " "))
regex <- "(^|[^@\\w])@(\\w{1,15})\\b"
idx <- grep(regex, theString1, perl = T)
theString1[idx]
[1] "@foobar" "@foo" "(@bar)"
If you want to use @Jerry's answer in R
:
regex <- "@([A-Za-z]+[A-Za-z0-9_]+)(?![A-Za-z0-9_]*\\.)"
idx <- grep(regex, theString1, perl = T)
theString1[idx]
[1] "@foobar" "@foo" "(@bar)"
Both of these methods include the parenthesis that you don't want, however.
UPDATE This will get to you start-to-finish with no parentheses or any other kind of punctuation (except underscores, since they're allowed in usernames)
theString <- '@foobar Foobar! and @fo_o (@bar) but not [email protected]'
theString1 <- unlist(strsplit(theString, " "))
regex1 <- "(^|[^@\\w])@(\\w{1,15})\\b" # get strings with @
regex2 <- "[^[:alnum:]@_]" # remove all punctuation except _ and @
users <- gsub(regex2, "", theString1[grep(regex1, theString1, perl = T)])
users
[1] "@foobar" "@fo_o" "@bar"
@[a-zA-Z0-9_]{0,15}
Where:
@
matches the character @
literally (case sensitive).
[a-zA-Z0-15]
match a single character present in the list
{0,15}
Quantifier matches between 0 and 15 times, as many times as
possible, giving back as needed
It is working fine on selecting twitter usernames from a mixed dataset.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With