Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R regex - extract words beginning with @ symbol

Tags:

regex

r

stringr

I'm trying to extract twitter handles from tweets using R's stringr package. For example, suppose I want to get all words in a vector that begin with "A". I can do this like so

library(stringr)

# Get all words that begin with "A"
str_extract_all(c("hAi", "hi Ahello Ame"), "(?<=\\b)A[^\\s]+")

[[1]]
character(0)

[[2]]
[1] "Ahello" "Ame"   

Great. Now let's try the same thing using "@" instead of "A"

str_extract_all(c("h@i", "hi @hello @me"), "(?<=\\b)\\@[^\\s]+")

[[1]]
[1] "@i"

[[2]]
character(0)

Why does this example give the opposite result that I was expecting and how can I fix it?

like image 220
Ben Avatar asked Mar 14 '19 20:03

Ben


People also ask

How do I extract words from a string in R?

To extract words from a string vector, we can use word function of stringr package. For example, if we have a vector called x that contains 100 words then first 20 words can be extracted by using the command word(x,start=1,end=20,sep=fixed(" ")).

How do I extract a character from a string in R?

substring() function in R Programming Language is used to extract substrings in a character vector. You can easily extract the required substring or character from the given string.

Can you use regex in R?

A 'regular expression' is a pattern that describes a set of strings. Two types of regular expressions are used in R, extended regular expressions (the default) and Perl-like regular expressions used by perl = TRUE .


1 Answers

It looks like you probably mean

str_extract_all(c("h@i", "hi @hello @me", "@twitter"), "(?<=^|\\s)@[^\\s]+")
# [[1]]
# character(0)
# [[2]]
# [1] "@hello" "@me" 
# [[3]]
# [1] "@twitter"

The \b in a regular expression is a boundary and it occurs "Between two characters in the string, where one is a word character and the other is not a word character." see here. Since an space and "@" are both non-word characters, there is no boundary before the "@".

With this revision you match either the start of the string or values that come after spaces.

like image 200
MrFlick Avatar answered Sep 27 '22 22:09

MrFlick