I have a data.frame with a single column "Terms". This could contain a string of multiple words. Each term contains at least two words or more, no upper limit.
From this column "Terms", I would like to extract the last word and store it in a new column "Last".
# load library
library(dplyr)
library(stringi)
# read csv
df <- read("filename.txt",stringsAsFactors=F)
# show df
head(df)
# Term
# 1 this is for the
# 2 thank you for
# 3 the following
# 4 the fact that
# 5 the first
I have prepared a function LastWord
which works well when a single string is given.
However, when a vector of string is given, it still works with the first string in the vector. This has forced me to use mapply
when used with mutate
, to add a column as seen below.
LastWord <- function(InputWord) {
stri_sub(InputWord,stri_locate_last(str=InputWord, fixed=" ")[1,1]+1, stri_length(InputWord))
}
df <- mutate(df, Last=mapply(LastWord, df$Term))
Using mapply
makes the process very slow. I generally need to process around 10 to 15 million lines or terms at a time. It takes hours.
Could anyone suggest a way to create the LastWord
function that works with vector rather than a string?
You can try:
df$LastWord <- gsub(".* ([^ ]+)$", "\\1", df$Term)
df
# Term LastWord
# 1 this is for the the
# 2 thank you for for
# 3 the following following
# 4 the fact that that
# 5 the first first
In the gsub
call, the expression between the brackets matches anything that is not a space at least one time (instead of [^ ]+
, [a-zA-Z]+
could work too) at the end of the string ($
). The fact that it is in between brackets permit to capture the expression with \\1
. So gsub
only keeps what is in between brackets as replacement.
EDIT:
As @akrun mentionned in the comments, in this case, sub
can also be used instead of gsub
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With