Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Conditionally pasting values from one column to another in R

Tags:

r

I'm struggling with some data wrangling stuff which I feel should be easy to solve (with lapply or something) but I just can't get it to work (and I'm evidently rusty as hell with this stuff). I'm working with a web-dataset structured as follows:

 df <- data.frame("ID" = c(1, 1, 1, 2, 3, 3),  
                 "URL_visit" = c(1, 2, 3, 1, 1, 2), # e.g. customer ID #1 has visited 3 pages
                 "URL_name" = c("home", "login", "product_page", "home", "home", "product_page"),
                 "duration" = c(14, 40, 233, 8, 76, 561),
                 "home" = c(1, 0, 0, 1, 1, 0),
                 "login" = c(0, 1, 0, 0, 0, 0), 
                 "product_page" = c(0, 0, 1, 0, 0, 1)
                   )

So basically a customer ID field, a number for each event per customer, the URL they visited in that event, followed by a column for each URL with an indicator (1/0) whether the customer visited that particular URL in that event.

My goal is to have a piece of code that replaces the 1/0 indicators with the duration if a match (1) is found, and retains 0 if not. In other words:

  • (1) Row-wise checks the values of the URL-columns (cols 5:7 aka home, login, product_page in my example. I also have a vector with the column names) to be a 1 or 0
  • (2) If it finds a 1, inserts the value from the duration column, otherwise 0.

Either replacing all the current (0/1) values in the URL-columns with duration where applicable, or making a new set of columns (e.g., "home_duration") works for me.

A manual example solution is:

df %<>% dplyr::mutate(home_duration = if_else(home == 1, duration, 0))

But of course my aim is to automate this, and conduct it for the whole set of URL-columns (passing a vector with URL column names).

Help is much appreciated! Thanks! :)

like image 592
TheNaidge Avatar asked Aug 06 '20 12:08

TheNaidge


2 Answers

You can try pivoting it to long, doing your transformation, and then pivoting it back wide again.

library(dplyr)
library(tidyr)

url_col_names <- c("home", "login", "product_page")

df %>% 
  pivot_longer(url_col_names, names_to = "url", values_to = "url_duration") %>% 
  mutate(url_duration = url_duration * duration) %>% 
  pivot_wider(names_from = "url", values_from = "url_duration")
# A tibble: 6 x 7
     ID URL_visit URL_name     duration  home login product_page
  <dbl>     <dbl> <fct>           <dbl> <dbl> <dbl>        <dbl>
1     1         1 home               14    14     0            0
2     1         2 login              40     0    40            0
3     1         3 product_page      233     0     0          233
4     2         1 home                8     8     0            0
5     3         1 home               76    76     0            0
6     3         2 product_page      561     0     0          561

Another way, probably more simple, is to do this.

df %>% 
  mutate(across(any_of(url_col_names), ~ . * duration))
  ID URL_visit     URL_name duration home login product_page
1  1         1         home       14   14     0            0
2  1         2        login       40    0    40            0
3  1         3 product_page      233    0     0          233
4  2         1         home        8    8     0            0
5  3         1         home       76   76     0            0
6  3         2 product_page      561    0     0          561

Edit

On another note, I imagine you created those indicator variables? If you are just hoping to replace them, then you actually might not need to create them to begin with. You can just pivot_wider() from the start.

This would assume that your ID and URL_visit columns form a unique row combination.

df2 <- df[, 1:4]

df2 %>% 
  pivot_wider(names_from = "URL_name", values_from = "duration", values_fill = 0)
like image 127
Adam Avatar answered Nov 03 '22 00:11

Adam


A simple multiplication should do the trick (this is equivalent to @Adam 's tidyverse solution above but in base R)

url_col_names <- c('home','login','product_page')    
df$duration * df[,url_col_names] -> df[,url_col_names]

To rename the columns, you can do:

names(df)[names(df) %in% url_col_names] <- paste0(url_col_names, '_', 'duration')
          
like image 39
Otto Kässi Avatar answered Nov 02 '22 23:11

Otto Kässi