I have a tibble with a list of words for each row. I want to create a new variable from a function that searches for a keyword and, if it finds the keyword, creates a string composed of the keyword plus-and-minus 3 words.
The code below is close, but, rather than grabbing all three words before and after my keyword, it grabs the single word 3 ahead/behind.
df <- tibble(words = c("it", "was", "the", "best", "of", "times",
"it", "was", "the", "worst", "of", "times"))
df <- df %>% mutate(chunks = ifelse(words=="times",
paste(lag(words, 3),
words,
lead(words, 3), sep = " "),
NA))
The most intuitive solution would be if the lag
function could do something like this: lead(words, 1:3)
but that doesn't work.
Obviously I could pretty quickly do this by hand (paste(lead(words,3), lead(words,2), lead(words,1),...lag(words,3)
), but I'll eventually actually want to be able to grab the keyword plus-and-minus 50 words--too much to hand-code.
Would be ideal if a solution existed in the tidyverse, but any solution would be helpful. Any help would be appreciated.
The LEAD function is used to access data from SUBSEQUENT rows along with data from the current row. The LAG function is used to access data from PREVIOUS rows along with data from the current row. An ORDER BY clause is required when working with LEAD and LAG functions, but a PARTITION BY clause is optional.
Just like LAG() , the LEAD() function takes three arguments: the name of a column or an expression, the offset to be skipped below, and the default value to be returned if the stored value obtained from the row below is empty.
LAG provides access to a row at a given physical offset that comes before the current row. Use this analytic function in a SELECT statement to compare values in the current row with values in a previous row.
Default value of the offset is 1 and does not have a negative value. If we pass negative value LEAD function threw exception. Default value: this is value to return when scalar expression at the offset is null. If the default value is not specified then function will return NULL.
One option would be sapply
:
library(dplyr)
df %>%
mutate(
chunks = ifelse(
words == "times",
sapply(
1:nrow(.),
function(x) paste(words[pmax(1, x - 3):pmin(x + 3, nrow(.))], collapse = " ")
),
NA
)
)
Output:
# A tibble: 12 x 2
words chunks
<chr> <chr>
1 it NA
2 was NA
3 the NA
4 best NA
5 of NA
6 times the best of times it was the
7 it NA
8 was NA
9 the NA
10 worst NA
11 of NA
12 times the worst of times
Although not an explicit lead
or lag
function, it can often serve the purpose as well.
Here is a another tidyverse
solution using lag
and lead
laglead_f <- function(what, range)
setNames(paste(what, "(., ", range, ", default = '')"), paste(what, range))
df %>%
mutate_at(vars(words), funs_(c(laglead_f("lag", 3:0), laglead_f("lead", 1:3)))) %>%
unite(chunks, -words, sep = " ") %>%
mutate(chunks = ifelse(words == "times", trimws(chunks), NA))
## A tibble: 12 x 2
# words chunks
# <chr> <chr>
# 1 it NA
# 2 was NA
# 3 the NA
# 4 best NA
# 5 of NA
# 6 times the best of times it was the
# 7 it NA
# 8 was NA
# 9 the NA
#10 worst NA
#11 of NA
#12 times the worst of times
The idea is to store values from the three lag
ged and lead
ing vectors in new columns with mutate_at
and a named function, unite
those columns and then filter entries based on your condition where words == "times"
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With