Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Forced To Use mapply Is There A Workaround

Tags:

r

I have a data.frame with a single column "Terms". This could contain a string of multiple words. Each term contains at least two words or more, no upper limit.

From this column "Terms", I would like to extract the last word and store it in a new column "Last".

# load library
library(dplyr)
library(stringi)

# read csv 
df <- read("filename.txt",stringsAsFactors=F)

# show df
head(df)

#              Term
# 1 this is for the
# 2   thank you for
# 3   the following
# 4   the fact that
# 5       the first 

I have prepared a function LastWord which works well when a single string is given.
However, when a vector of string is given, it still works with the first string in the vector. This has forced me to use mapply when used with mutate, to add a column as seen below.

LastWord <- function(InputWord) {
    stri_sub(InputWord,stri_locate_last(str=InputWord, fixed=" ")[1,1]+1, stri_length(InputWord))
}

df <- mutate(df, Last=mapply(LastWord, df$Term))

Using mapply makes the process very slow. I generally need to process around 10 to 15 million lines or terms at a time. It takes hours.

Could anyone suggest a way to create the LastWord function that works with vector rather than a string?

like image 953
Cyrus Lentin Avatar asked Dec 08 '22 03:12

Cyrus Lentin


1 Answers

You can try:

df$LastWord <- gsub(".* ([^ ]+)$", "\\1", df$Term)
df
             # Term  LastWord
# 1 this is for the       the
# 2   thank you for       for
# 3   the following following
# 4   the fact that      that
# 5       the first     first

In the gsub call, the expression between the brackets matches anything that is not a space at least one time (instead of [^ ]+, [a-zA-Z]+ could work too) at the end of the string ($). The fact that it is in between brackets permit to capture the expression with \\1. So gsub only keeps what is in between brackets as replacement.

EDIT:
As @akrun mentionned in the comments, in this case, sub can also be used instead of gsub.

like image 111
Cath Avatar answered Dec 10 '22 17:12

Cath