Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

lead or lag function to get several values, not just the nth

Tags:

r

dplyr

lag

lead

I have a tibble with a list of words for each row. I want to create a new variable from a function that searches for a keyword and, if it finds the keyword, creates a string composed of the keyword plus-and-minus 3 words.

The code below is close, but, rather than grabbing all three words before and after my keyword, it grabs the single word 3 ahead/behind.

df <- tibble(words = c("it", "was", "the", "best", "of", "times", 
                       "it", "was", "the", "worst", "of", "times"))
df <- df %>% mutate(chunks = ifelse(words=="times", 
                                    paste(lag(words, 3), 
                                          words, 
                                          lead(words, 3), sep = " "),
                                    NA))

The most intuitive solution would be if the lag function could do something like this: lead(words, 1:3) but that doesn't work.

Obviously I could pretty quickly do this by hand (paste(lead(words,3), lead(words,2), lead(words,1),...lag(words,3)), but I'll eventually actually want to be able to grab the keyword plus-and-minus 50 words--too much to hand-code.

Would be ideal if a solution existed in the tidyverse, but any solution would be helpful. Any help would be appreciated.

like image 587
wscampbell Avatar asked Mar 05 '19 20:03

wscampbell


People also ask

What is the difference between lead and lag functions?

The LEAD function is used to access data from SUBSEQUENT rows along with data from the current row. The LAG function is used to access data from PREVIOUS rows along with data from the current row. An ORDER BY clause is required when working with LEAD and LAG functions, but a PARTITION BY clause is optional.

What is the order of the three arguments for the lag and lead functions?

Just like LAG() , the LEAD() function takes three arguments: the name of a column or an expression, the offset to be skipped below, and the default value to be returned if the stored value obtained from the row below is empty.

What does lag function do in SQL?

LAG provides access to a row at a given physical offset that comes before the current row. Use this analytic function in a SELECT statement to compare values in the current row with values in a previous row.

What is the default offset value in the lead & lag function?

Default value of the offset is 1 and does not have a negative value. If we pass negative value LEAD function threw exception. Default value: this is value to return when scalar expression at the offset is null. If the default value is not specified then function will return NULL.


2 Answers

One option would be sapply:

library(dplyr)

df %>%
  mutate(
    chunks = ifelse(
      words == "times",
      sapply(
        1:nrow(.),
        function(x) paste(words[pmax(1, x - 3):pmin(x + 3, nrow(.))], collapse = " ")
        ),
      NA
      )
  )

Output:

# A tibble: 12 x 2
   words chunks                      
   <chr> <chr>                       
 1 it    NA                          
 2 was   NA                          
 3 the   NA                          
 4 best  NA                          
 5 of    NA                          
 6 times the best of times it was the
 7 it    NA                          
 8 was   NA                          
 9 the   NA                          
10 worst NA                          
11 of    NA                          
12 times the worst of times   

Although not an explicit lead or lag function, it can often serve the purpose as well.

like image 64
arg0naut91 Avatar answered Oct 08 '22 08:10

arg0naut91


Here is a another tidyverse solution using lag and lead

laglead_f <- function(what, range)
    setNames(paste(what, "(., ", range, ", default = '')"), paste(what, range))

df %>%
    mutate_at(vars(words), funs_(c(laglead_f("lag", 3:0), laglead_f("lead", 1:3)))) %>%
    unite(chunks, -words, sep = " ") %>%
    mutate(chunks = ifelse(words == "times", trimws(chunks), NA))
## A tibble: 12 x 2
#   words chunks
#   <chr> <chr>
# 1 it    NA
# 2 was   NA
# 3 the   NA
# 4 best  NA
# 5 of    NA
# 6 times the best of times it was the
# 7 it    NA
# 8 was   NA
# 9 the   NA
#10 worst NA
#11 of    NA
#12 times the worst of times

The idea is to store values from the three lagged and leading vectors in new columns with mutate_at and a named function, unite those columns and then filter entries based on your condition where words == "times".

like image 39
Maurits Evers Avatar answered Oct 08 '22 09:10

Maurits Evers