Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Mutate with a list column function in dplyr

Tags:

text

r

dplyr

I am trying to calculate the Jaccard similarity between a source vector and comparison vectors in a tibble.

First, create a tibble with a names_ field (vector of strings). Using dplyr's mutate, create names_vec, a list-column, where each row is now a vector (each element of vector is a letter).

Then, create a new tibble with column jaccard_sim that is supposed to calculate the Jaccard similarity.

source_vec <- c('a', 'b', 'c')

df_comp <- tibble(names_ = c("b d f", "u k g", "m o c"),
              names_vec = strsplit(names_, ' '))

df_comp_jaccard <- df_comp %>%
   dplyr::mutate(jaccard_sim = length(intersect(names_vec, source_vec))/length(union(names_vec, source_vec)))

All the values in jaccard_sim are zero. However, if we run something like this, we get the correct Jaccard similarity of 0.2 for the first entry:

a <- length(intersect(source_vec, df_comp[[1,2]]))
b <- length(union(source_vec, df_comp[[1,2]]))
a/b
like image 481
matsuo_basho Avatar asked Oct 23 '17 09:10

matsuo_basho


People also ask

Is it possible to have a list as a column of a Tibble?

Tibbles can also have columns that are lists. These columns are (appropriately) called list columns. List columns are more flexible than normal, atomic vector columns.

What does %>% do in dplyr?

%>% is called the forward pipe operator in R. It provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. It is defined by the package magrittr (CRAN) and is heavily used by dplyr (CRAN).

How do you make a mutate function in R?

In R programming, the mutate function is used to create a new variable from a data set. In order to use the function, we need to install the dplyr package, which is an add-on to R that includes a host of cool functions for selecting, filtering, grouping, and arranging data.


2 Answers

You could simply add rowwise

df_comp_jaccard <- df_comp %>%
  rowwise() %>%
  dplyr::mutate(jaccard_sim = length(intersect(names_vec, source_vec))/
                              length(union(names_vec, source_vec)))

# A tibble: 3 x 3
  names_ names_vec jaccard_sim
   <chr>    <list>       <dbl>
1  b d f <chr [3]>         0.2
2  u k g <chr [3]>         0.0
3  m o c <chr [3]>         0.2

Using rowwise you get the intuitive behavior some would expect when using mutate : "do this operation for every row".

Not using rowwise means you take advantage of vectorized functions, which is much faster, that's why it's the default, but may yield unexpected results if you're not careful.

The impression that mutate (or other dplyr functions) works row-wise is an illusion due to the fact you're working with vectorized functions, in fact you're always juggling with full columns.

I'll illustrate with a couple of examples:

Sometimes the result is the same, with a vectorized function such as paste:

tibble(a=1:10,b=10:1) %>% mutate(X = paste(a,b,sep="_"))
tibble(a=1:10,b=10:1) %>% rowwise %>% mutate(X = paste(a,b,sep="_"))
# # A tibble: 5 x 3
#       a     b     X
#   <int> <int> <chr>
# 1     1     5   1_5
# 2     2     4   2_4
# 3     3     3   3_3
# 4     4     2   4_2
# 5     5     1   5_1

And sometimes it's different, with a function that is not vectorized, such as max:

tibble(a=1:5,b=5:1) %>% mutate(max(a,b))
# # A tibble: 5 x 3
#       a     b `max(a, b)`
#   <int> <int>       <int>
# 1     1     5           5
# 2     2     4           5
# 3     3     3           5
# 4     4     2           5
# 5     5     1           5

tibble(a=1:5,b=5:1) %>% rowwise %>% mutate(max(a,b))
# # A tibble: 5 x 3
#       a     b `max(a, b)`
#   <int> <int>       <int>
# 1     1     5           5
# 2     2     4           4
# 3     3     3           3
# 4     4     2           4
# 5     5     1           5

Note that in this case you shouldn't use rowwise in a real life situation, but pmax which is vectorized for this purpose:

tibble(a=1:5,b=5:1) %>% mutate(pmax(a,b))
# # A tibble: 5 x 3
#       a     b `pmax(a, b)`
#   <int> <int>        <int>
# 1     1     5            5
# 2     2     4            4
# 3     3     3            3
# 4     4     2            4
# 5     5     1            5

Intersect is such function, you fed this function one list column containing vectors and one other vector, these 2 objects have no intersection.

like image 185
Moody_Mudskipper Avatar answered Sep 20 '22 08:09

Moody_Mudskipper


We can use map to loop through the list

library(tidyverse)
df_comp %>% 
     mutate(jaccard_sim = map_dbl(names_vec, ~length(intersect(.x, 
                 source_vec))/length(union(.x, source_vec))))
# A tibble: 3 x 3
#   names_ names_vec jaccard_sim
#    <chr>    <list>       <dbl>
#1  b d f <chr [3]>         0.2
#2  u k g <chr [3]>         0.0
#3  m o c <chr [3]>         0.2

The map functions are optimized. Below are the system.time for a slightly bigger dataset

df_comp1 <- df_comp[rep(1:nrow(df_comp), 1e5),]
system.time({

 df_comp1 %>%
      rowwise() %>%
      dplyr::mutate(jaccard_sim = length(intersect(names_vec, source_vec))/length(union(names_vec, source_vec)))
    })
 #user  system elapsed 
 # 25.59    0.05   25.96 

system.time({
  df_comp1 %>% 
     mutate(jaccard_sim = map_dbl(names_vec, ~length(intersect(.x, 
                 source_vec))/length(union(.x, source_vec))))
   })
#user  system elapsed 
#  13.22    0.00   13.22 
like image 28
akrun Avatar answered Sep 21 '22 08:09

akrun