Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract multiple substrings from a single

Tags:

text

r

I want to extract the tags (twitter handles) from tweets.

tweet <- "@me bla bla bla bla @him some text @her"

With:

at <- regexpr('@[[:alnum:]]*', tweet)
handle <- substr(tweet,at+1,at+attr(at,"match.length")-1)

I successfully extract the first handle

handle
[1] "me"

However I am unable to find a way to extract the others, does anyone know a way to do this? - Thanks

like image 569
JohnCoene Avatar asked Aug 28 '14 07:08

JohnCoene


2 Answers

library(stringr)
str_extract_all(tweet,perl("(?<=@)\\w+"))[[1]]
#[1] "me"  "him" "her"

Or using stringi for fast processing

 library(stringi)
 stri_extract_all_regex(tweet, "(?<=@)\\w+")[[1]]
 #[1] "me"  "him" "her"

Benchmarks

 tweet1 <- rep(tweet, 1e5)
 f1 <- function() {m <- regmatches(tweet1, gregexpr("@[a-z]+", tweet1))[[1]] 
              substring(m, 2)}

 f2 <- function() {stri_extract_all_regex(tweet1, "(?<=@)\\w+")[[1]]}
 f3 <- function() {regmatches(tweet1, gregexpr("(?<=@)[a-z]+", tweet1,perl=T))}

 library(microbenchmark)
 microbenchmark(f1(), f2(), f3(), unit="relative")
 #Unit: relative
 # expr      min       lq   median       uq      max neval
 #f1() 5.387274 5.253141 5.143694 5.166854 4.544567   100
 #f2() 1.000000 1.000000 1.000000 1.000000 1.000000   100
 #f3() 5.523090 5.440423 5.301971 5.335775 4.721337   100
like image 165
akrun Avatar answered Nov 08 '22 23:11

akrun


I would suggest:

tweet <- "@me bla bla bla bla @him some text @her"
regmatches(tweet, gregexpr("(?<=@)[a-z]+", tweet,perl=T))

## [[1]]
## [1] "me"  "him" "her"
like image 5
DJJ Avatar answered Nov 09 '22 00:11

DJJ