Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

most efficient R cosine calculation

I have two vectors of values and one vector of weights, and I need to calculate the cosine similarity. For complicated reasons, I can only calculate the cosine for one pair at a time. But I have to do it many millions of times.

cosine_calc <- function(a,b,wts) {
  #scale both vectors by the weights, then compute the cosine of the scaled vectors
  a = a*wts
  b = b*wts
  (a %*% b)/(sqrt(a%*%a)*sqrt(b%*%b))
}

works, but I want to try to eke better performance out of it.

Example data:

a = c(-1.2092420, -0.7053822, 1.4364633, 1.3612304, -0.3029147, 1.0319704, 0.6707610, -2.2128987, -0.9839970, -0.4302205)
b = c(-0.69042619, 0.05811749, -0.17836802, 0.15699691, 0.78575477, 0.27925779, -0.08552864, -1.31031219, -1.92756861, -1.36350112)
w = c(0.26333839, 0.12803180, 0.62396023, 0.37393705, 0.13539926, 0.09199102, 0.37347546, 1.36790007, 0.64978409, 0.46256891)
> cosine_calc(a,b,w)[,1]
[1,] 0.8390671

This question points out that there are other predefined cosine functions available in R, but says nothing about their relative efficiency.

like image 705
ansate Avatar asked Nov 16 '11 21:11

ansate


People also ask

How do you calculate cosine in R?

cos in R. The cos() is a built-in mathematical R function that computes the cosine value of numeric value data. The cos() function takes a numerical value as an argument and returns the cosine value. To calculate the cosine of a value in R, use the cos() function.

Why is cosine similarity better than Euclidean distance?

The cosine similarity is advantageous because even if the two similar documents are far apart by the Euclidean distance because of the size (like, the word 'cricket' appeared 50 times in one document and 10 times in another) they could still have a smaller angle between them. Smaller the angle, higher the similarity.

What is a good cosine similarity?

The higher similarity, the lower distances. When you pick the threshold for similarities for text/documents, usually a value higher than 0.5 shows strong similarities.

Where is cosine similarity used?

Cosine similarity measures the similarity between two vectors of an inner product space. It is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in roughly the same direction. It is often used to measure document similarity in text analysis.


1 Answers

All the functions you're using are .Primitive (therefore already call compiled code directly), so it will be hard to find consistent speed gains outside of re-building R with an optimized BLAS. With that said, here is one option that might be faster for larger vectors:

cosine_calc2 <- function(a,b,wts) {
  a = a*wts
  b = b*wts
  crossprod(a,b)/sqrt(crossprod(a)*crossprod(b))
}

all.equal(cosine_calc1(a,b,w),cosine_calc2(a,b,w))
# [1] TRUE

# Check some timings
library(rbenchmark)
# cosine_calc2 is slower on my machine in this case
benchmark(
  cosine_calc1(a,b,w),
  cosine_calc2(a,b,w), replications=1e5, columns=1:4 )
#                    test replications user.self sys.self
# 1 cosine_calc1(a, b, w)       100000      1.06     0.02
# 2 cosine_calc2(a, b, w)       100000      1.21     0.00

# but cosine_calc2 is faster for larger vectors
set.seed(21)
a <- rnorm(1000)
b <- rnorm(1000)
w <- runif(1000)
benchmark(
  cosine_calc1(a,b,w),
  cosine_calc2(a,b,w), replications=1e5, columns=1:4 )
#                    test replications user.self sys.self
# 1 cosine_calc1(a, b, w)       100000      3.83        0
# 2 cosine_calc2(a, b, w)       100000      2.12        0

UPDATE:

Profiling reveals that quite a bit of time is spent multiplying each vector by the weight vector.

> Rprof(); for(i in 1:100000) cosine_calc2(a,b,w); Rprof(NULL); summaryRprof()
$by.self
             self.time self.pct total.time total.pct
*                 0.80    45.98       0.80     45.98
crossprod         0.56    32.18       0.56     32.18
cosine_calc2      0.32    18.39       1.74    100.00
sqrt              0.06     3.45       0.06      3.45

$by.total
             total.time total.pct self.time self.pct
cosine_calc2       1.74    100.00      0.32    18.39
*                  0.80     45.98      0.80    45.98
crossprod          0.56     32.18      0.56    32.18
sqrt               0.06      3.45      0.06     3.45

$sample.interval
[1] 0.02

$sampling.time
[1] 1.74

If you can do the weighting before you have to call the function millions of times, it could save you quite a bit of time. cosine_calc3 is marginally faster than your original function with small vectors. Byte-compiling the function should give you another marginal speedup.

cosine_calc3 <- function(a,b) {
  crossprod(a,b)/sqrt(crossprod(a)*crossprod(b))
}
A = a*w
B = b*w
# Run again on the 1000-element vectors
benchmark(
  cosine_calc1(a,b,w),
  cosine_calc2(a,b,w),
  cosine_calc3(A,B), replications=1e5, columns=1:4 )
#                    test replications user.self sys.self
# 1 cosine_calc1(a, b, w)       100000      3.85     0.00
# 2 cosine_calc2(a, b, w)       100000      2.13     0.02
# 3    cosine_calc3(A, B)       100000      1.31     0.00
like image 144
Joshua Ulrich Avatar answered Oct 12 '22 23:10

Joshua Ulrich