I'm trying to extract twitter handles from tweets using R's stringr package. For example, suppose I want to get all words in a vector that begin with "A". I can do this like so
library(stringr)
# Get all words that begin with "A"
str_extract_all(c("hAi", "hi Ahello Ame"), "(?<=\\b)A[^\\s]+")
[[1]]
character(0)
[[2]]
[1] "Ahello" "Ame"
Great. Now let's try the same thing using "@" instead of "A"
str_extract_all(c("h@i", "hi @hello @me"), "(?<=\\b)\\@[^\\s]+")
[[1]]
[1] "@i"
[[2]]
character(0)
Why does this example give the opposite result that I was expecting and how can I fix it?
To extract words from a string vector, we can use word function of stringr package. For example, if we have a vector called x that contains 100 words then first 20 words can be extracted by using the command word(x,start=1,end=20,sep=fixed(" ")).
substring() function in R Programming Language is used to extract substrings in a character vector. You can easily extract the required substring or character from the given string.
A 'regular expression' is a pattern that describes a set of strings. Two types of regular expressions are used in R, extended regular expressions (the default) and Perl-like regular expressions used by perl = TRUE .
It looks like you probably mean
str_extract_all(c("h@i", "hi @hello @me", "@twitter"), "(?<=^|\\s)@[^\\s]+")
# [[1]]
# character(0)
# [[2]]
# [1] "@hello" "@me"
# [[3]]
# [1] "@twitter"
The \b
in a regular expression is a boundary and it occurs "Between two characters in the string, where one is a word character and the other is not a word character." see here. Since an space and "@" are both non-word characters, there is no boundary before the "@".
With this revision you match either the start of the string or values that come after spaces.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With