I am trying to calculate the Jaccard similarity between a source vector and comparison vectors in a tibble. First, create a tibble with a names_ field (vector of strings). Using dplyr's mutate, create names_vec, a list-column, where each row is now a vector (each element of vector is a letter). Then, create a new tibble with column jaccard_sim that is supposed to calculate the Jaccard similarity. <pre class="prettyprint"><code>source_vec <- c('a', 'b', 'c') df_comp <- tibble(names_ = c("b d f", "u k g", "m o c"), names_vec = strsplit(names_, ' ')) df_comp_jaccard <- df_comp %>% dplyr::mutate(jaccard_sim = length(intersect(names_vec, source_vec))/length(union(names_vec, source_vec))) </code></pre> All the values in jaccard_sim are zero. However, if we run something like this, we get the correct Jaccard similarity of 0.2 for the first entry: <pre class="prettyprint"><code>a <- length(intersect(source_vec, df_comp[[1,2]])) b <- length(union(source_vec, df_comp[[1,2]])) a/b </code></pre>

You could simply add <code>rowwise</code> <pre class="prettyprint"><code>df_comp_jaccard <- df_comp %>% rowwise() %>% dplyr::mutate(jaccard_sim = length(intersect(names_vec, source_vec))/ length(union(names_vec, source_vec))) # A tibble: 3 x 3 names_ names_vec jaccard_sim <chr> <list> <dbl> 1 b d f <chr [3]> 0.2 2 u k g <chr [3]> 0.0 3 m o c <chr [3]> 0.2 </code></pre> Using <code>rowwise</code> you get the intuitive behavior some would expect when using <code>mutate</code> : "do this operation for every row". Not using <code>rowwise</code> means you take advantage of vectorized functions, which is much faster, that's why it's the default, but may yield unexpected results if you're not careful. The impression that <code>mutate</code> (or other <code>dplyr</code> functions) works row-wise is an illusion due to the fact you're working with vectorized functions, in fact you're always juggling with full columns. I'll illustrate with a couple of examples: Sometimes the result is the same, with a vectorized function such as <code>paste</code>: <pre class="prettyprint"><code>tibble(a=1:10,b=10:1) %>% mutate(X = paste(a,b,sep="_")) tibble(a=1:10,b=10:1) %>% rowwise %>% mutate(X = paste(a,b,sep="_")) # # A tibble: 5 x 3 # a b X # <int> <int> <chr> # 1 1 5 1_5 # 2 2 4 2_4 # 3 3 3 3_3 # 4 4 2 4_2 # 5 5 1 5_1 </code></pre> And sometimes it's different, with a function that is not vectorized, such as <code>max</code>: <pre class="prettyprint"><code>tibble(a=1:5,b=5:1) %>% mutate(max(a,b)) # # A tibble: 5 x 3 # a b `max(a, b)` # <int> <int> <int> # 1 1 5 5 # 2 2 4 5 # 3 3 3 5 # 4 4 2 5 # 5 5 1 5 tibble(a=1:5,b=5:1) %>% rowwise %>% mutate(max(a,b)) # # A tibble: 5 x 3 # a b `max(a, b)` # <int> <int> <int> # 1 1 5 5 # 2 2 4 4 # 3 3 3 3 # 4 4 2 4 # 5 5 1 5 </code></pre> Note that in this case you shouldn't use <code>rowwise</code> in a real life situation, but <code>pmax</code> which is vectorized for this purpose: <pre class="prettyprint"><code>tibble(a=1:5,b=5:1) %>% mutate(pmax(a,b)) # # A tibble: 5 x 3 # a b `pmax(a, b)` # <int> <int> <int> # 1 1 5 5 # 2 2 4 4 # 3 3 3 3 # 4 4 2 4 # 5 5 1 5 </code></pre> Intersect is such function, you fed this function one list column containing vectors and one other vector, these 2 objects have no intersection.

We can use <code>map</code> to loop through the <code>list</code> <pre class="prettyprint"><code>library(tidyverse) df_comp %>% mutate(jaccard_sim = map_dbl(names_vec, ~length(intersect(.x, source_vec))/length(union(.x, source_vec)))) # A tibble: 3 x 3 # names_ names_vec jaccard_sim # <chr> <list> <dbl> #1 b d f <chr [3]> 0.2 #2 u k g <chr [3]> 0.0 #3 m o c <chr [3]> 0.2 </code></pre> <hr> The <code>map</code> functions are optimized. Below are the <code>system.time</code> for a slightly bigger dataset <pre class="prettyprint"><code>df_comp1 <- df_comp[rep(1:nrow(df_comp), 1e5),] system.time({ df_comp1 %>% rowwise() %>% dplyr::mutate(jaccard_sim = length(intersect(names_vec, source_vec))/length(union(names_vec, source_vec))) }) #user system elapsed # 25.59 0.05 25.96 system.time({ df_comp1 %>% mutate(jaccard_sim = map_dbl(names_vec, ~length(intersect(.x, source_vec))/length(union(.x, source_vec)))) }) #user system elapsed # 13.22 0.00 13.22 </code></pre>

Mutate with a list column function in dplyr

Tags:

text

r

dplyr

I am trying to calculate the Jaccard similarity between a source vector and comparison vectors in a tibble.

First, create a tibble with a names_ field (vector of strings). Using dplyr's mutate, create names_vec, a list-column, where each row is now a vector (each element of vector is a letter).

Then, create a new tibble with column jaccard_sim that is supposed to calculate the Jaccard similarity.

source_vec <- c('a', 'b', 'c')

df_comp <- tibble(names_ = c("b d f", "u k g", "m o c"),
              names_vec = strsplit(names_, ' '))

df_comp_jaccard <- df_comp %>%
   dplyr::mutate(jaccard_sim = length(intersect(names_vec, source_vec))/length(union(names_vec, source_vec)))

All the values in jaccard_sim are zero. However, if we run something like this, we get the correct Jaccard similarity of 0.2 for the first entry:

a <- length(intersect(source_vec, df_comp[[1,2]]))
b <- length(union(source_vec, df_comp[[1,2]]))
a/b

481

asked Oct 23 '17 09:10

matsuo_basho

2 Answers

You could simply add rowwise

df_comp_jaccard <- df_comp %>%
  rowwise() %>%
  dplyr::mutate(jaccard_sim = length(intersect(names_vec, source_vec))/
                              length(union(names_vec, source_vec)))

# A tibble: 3 x 3
  names_ names_vec jaccard_sim
   <chr>    <list>       <dbl>
1  b d f <chr [3]>         0.2
2  u k g <chr [3]>         0.0
3  m o c <chr [3]>         0.2

Using rowwise you get the intuitive behavior some would expect when using mutate : "do this operation for every row".

Not using rowwise means you take advantage of vectorized functions, which is much faster, that's why it's the default, but may yield unexpected results if you're not careful.

The impression that mutate (or other dplyr functions) works row-wise is an illusion due to the fact you're working with vectorized functions, in fact you're always juggling with full columns.

I'll illustrate with a couple of examples:

Sometimes the result is the same, with a vectorized function such as paste:

tibble(a=1:10,b=10:1) %>% mutate(X = paste(a,b,sep="_"))
tibble(a=1:10,b=10:1) %>% rowwise %>% mutate(X = paste(a,b,sep="_"))
# # A tibble: 5 x 3
#       a     b     X
#   <int> <int> <chr>
# 1     1     5   1_5
# 2     2     4   2_4
# 3     3     3   3_3
# 4     4     2   4_2
# 5     5     1   5_1

And sometimes it's different, with a function that is not vectorized, such as max:

tibble(a=1:5,b=5:1) %>% mutate(max(a,b))
# # A tibble: 5 x 3
#       a     b `max(a, b)`
#   <int> <int>       <int>
# 1     1     5           5
# 2     2     4           5
# 3     3     3           5
# 4     4     2           5
# 5     5     1           5

tibble(a=1:5,b=5:1) %>% rowwise %>% mutate(max(a,b))
# # A tibble: 5 x 3
#       a     b `max(a, b)`
#   <int> <int>       <int>
# 1     1     5           5
# 2     2     4           4
# 3     3     3           3
# 4     4     2           4
# 5     5     1           5

Note that in this case you shouldn't use rowwise in a real life situation, but pmax which is vectorized for this purpose:

tibble(a=1:5,b=5:1) %>% mutate(pmax(a,b))
# # A tibble: 5 x 3
#       a     b `pmax(a, b)`
#   <int> <int>        <int>
# 1     1     5            5
# 2     2     4            4
# 3     3     3            3
# 4     4     2            4
# 5     5     1            5

Intersect is such function, you fed this function one list column containing vectors and one other vector, these 2 objects have no intersection.

185

answered Sep 20 '22 08:09

Moody_Mudskipper

We can use map to loop through the list

library(tidyverse)
df_comp %>% 
     mutate(jaccard_sim = map_dbl(names_vec, ~length(intersect(.x, 
                 source_vec))/length(union(.x, source_vec))))
# A tibble: 3 x 3
#   names_ names_vec jaccard_sim
#    <chr>    <list>       <dbl>
#1  b d f <chr [3]>         0.2
#2  u k g <chr [3]>         0.0
#3  m o c <chr [3]>         0.2

The map functions are optimized. Below are the system.time for a slightly bigger dataset

df_comp1 <- df_comp[rep(1:nrow(df_comp), 1e5),]
system.time({

 df_comp1 %>%
      rowwise() %>%
      dplyr::mutate(jaccard_sim = length(intersect(names_vec, source_vec))/length(union(names_vec, source_vec)))
    })
 #user  system elapsed 
 # 25.59    0.05   25.96 

system.time({
  df_comp1 %>% 
     mutate(jaccard_sim = map_dbl(names_vec, ~length(intersect(.x, 
                 source_vec))/length(union(.x, source_vec))))
   })
#user  system elapsed 
#  13.22    0.00   13.22

answered Sep 21 '22 08:09

akrun

Related questions
                            
                                R as a general purpose programming language [closed]
                            
                                passing a string as a data frame column name
                            
                                Ordering 1:17 by perfect square pairs
                            
                                write to csv file using separator
                            
                                R suppressing rownames in grid table
                            
                                Using table caption on R markdown file using knitr to use in pandoc to convert to pdf
                            
                                Strange output from fread when called from knitr
                            
                                skip some rows in read.csv in R
                            
                                How to convert from a list of lists to a list in R retaining names?
                            
                                Greatest distance between set of longitude/latitude points
                            
                                How to change factor labels into string in a data frame
                            
                                How can I remove the prefix (index indicator) [1] in knitr output?
                            
                                Use regex to insert space between collapsed words
                            
                                Vectorize() vs apply()
                            
                                Python equivalent of daisy() in the cluster package of R
                            
                                Keyboard shortcut to produce code chunk brackets in markdown in R for RStudio
                            
                                Adding points from other dataset to ggplot2
                            
                                Unlist a data frame by rows, not columns
                            
                                How is xgboost cover calculated?
                            
                                Convert a list of lists to a character vector

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With