This is a follow on question to an earlier post I made here - I think I made significant progress and now the question has changed.
I have a "matching" matrix which looks like the following:
    [,1] [,2]
[1,]    1    2
[2,]    5    6
[3,]    7    8
[4,]    9   10
[5,]   11   13
[6,]   14   15
[7,]   16   17
[8,]   18   19
I also have a dtm - document term matrix:
1108058_10-K_2005  . . . . . . . 1 . . . . 1 . . . . 1 . .
1108058_10-K_2006  . . . . . . . . . . . . . . . . . . . .
72243_10-K_2005    . . . . . . . . . . . . . . . . . . . .
1352341_10-K_2006  1 . 1 . . 1 . . . . . . . . 1 . . . . .
64040_10-K_2005    . . . . . . . . . . . . . . . . . . . .
64040_10-K_2006    . . . . . . . . . . . . . . . . . . . .
1111247_10-K_2005  . . . . . . . . . . . . . . . . . . . .
1111247_10-K_2006  . . . . 1 . . . . . . . . . . . . . . .
1129425_10-K_2005  . . . . . . . . . . 1 1 . . . . . . . .
1129425_10-K_2006  . . . . . . . . . . . . . . . 1 1 . . .
943894_10-K_2005   . . . . . . . . . . . . . . . . . . . .
943894_10-K/A_2005 . . . . . . . . . . . . . . . . . . . .
943894_10-K_2006   . . . 1 . . . . . 1 . . . . . . . . . .
1176316_10-K_2005  . . . . . . . . . . . . . . . . . . . .
1176316_10-K_2006  . . . . . . 1 . . . . . . . . . . . . .
805305_10-K_2005   . . . . . . . . . . . . . . . . . . . .
805305_10-K_2006   . 1 . . . . . . . . . . . 1 . . . . 1 1
63276_10-K_2005    . . . . . . . . 1 . . . . . . . . . . .
63276_10-K_2006    . . . . . . . . . . . . . . . . . . . .
I can run the following dist function:
dist2(dtm[matching[, 1], ], dtm[matching[, 2], ], method = "cosine", norm = "none")
Which outputs:
WARN [2019-09-11 20:51:40] Sparsity will be lost - worth to calculate similarity instead of distance.
8 x 8 Matrix of class "dgeMatrix"
                  1108058_10-K_2006 64040_10-K_2006 1111247_10-K_2006 1129425_10-K_2006
1108058_10-K_2005                 1               1                 1                 1
64040_10-K_2005                   1               1                 1                 1
1111247_10-K_2005                 1               1                 1                 1
1129425_10-K_2005                 1               1                 1                 1
943894_10-K_2005                  1               1                 1                 1
1176316_10-K_2005                 1               1                 1                 1
805305_10-K_2005                  1               1                 1                 1
63276_10-K_2005                   1               1                 1                 1
                  943894_10-K_2006 1176316_10-K_2006 805305_10-K_2006 63276_10-K_2006
1108058_10-K_2005                1                 1                1               1
64040_10-K_2005                  1                 1                1               1
1111247_10-K_2005                1                 1                1               1
1129425_10-K_2005                1                 1                1               1
943894_10-K_2005                 1                 1                1               1
1176316_10-K_2005                1                 1                1               1
805305_10-K_2005                 1                 1                1               1
63276_10-K_2005                  1                 1                1               1
Which almost does what I want but not quite. It is still calculating "too" many calculations. I want to calculate the dist2 function according to the "rowise" observations in matching. That is calculate dist2 for observation 1 and 2. Then calculate the next dist2 for observation 5 and 6 and then 7 and 8 and so on.
Data:
library(text2vec)
matching <- structure(c(1, 5, 7, 9, 11, 14, 16, 18, 2, 6, 8, 10, 13, 15, 
17, 19), .Dim = c(8L, 2L))
dtm <- new("dgCMatrix", i = c(3L, 16L, 3L, 12L, 7L, 3L, 14L, 0L, 17L, 
12L, 8L, 8L, 0L, 16L, 3L, 9L, 9L, 0L, 16L, 16L), p = 0:20, Dim = 19:20, 
    Dimnames = list(c("1108058_10-K_2005", "1108058_10-K_2006", 
    "72243_10-K_2005", "1352341_10-K_2006", "64040_10-K_2005", 
    "64040_10-K_2006", "1111247_10-K_2005", "1111247_10-K_2006", 
    "1129425_10-K_2005", "1129425_10-K_2006", "943894_10-K_2005", 
    "943894_10-K/A_2005", "943894_10-K_2006", "1176316_10-K_2005", 
    "1176316_10-K_2006", "805305_10-K_2005", "805305_10-K_2006", 
    "63276_10-K_2005", "63276_10-K_2006"), c("counterclaim", 
    "reacting", "dissipating", "delisted", "trades", "relocated", 
    "buyers", "allege", "wind", "antiquated", "initiating", "detract", 
    "instat", "putters", "confronted", "enrolling", "futility", 
    "repatriating", "oppose", "communicates")), x = c(1, 1, 1, 
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), factors = list())
EDIT (my attempts are incorrect): This allows me to apply the dist function on the first row:
  m1 <- as.matrix(dtm[matching[1, ], ])
  dist2(m1, method = "cosine", norm = "none")[1, 2]
Applying it on the second row:
  m1 <- as.matrix(dtm[matching[2, ], ])
  dist2(m1, method = "cosine", norm = "none")
Just need to iterate and create a function to apply it over all rows.
A hacked together some sort of solution (not complete):
for(i in 1:nrow(matching)){
  m <- as.matrix(dtm[matching[i, ], ])
  dist <- dist2(m, method = "cosine", norm = "none")[1, 2]
  print(dist)
}
If anybody can help make this into a function that would be great!
This doesn't give me the correct result
foo <- function(data){
  col1 = data[, 1]
  col2 = data[, 2]
  dist = dist2(dtm[col1, ], dtm[col2, ], method = "cosine", norm = "none")
  return(dist)
}
foo(matching)
or this (does not work):
apply(matching, 1, function(x, y) dist2(dtm[x, ], dtm[y, ], method = "cosine", norm = "norm"))
When I apply the "full" function over the matching data I get a matrix like this: dist2(dtm[matching[, 1], ], dtm[matching[, 2], ], method = rwmd, norm = "none") 
(Note: I use a custom method rwmd instead of cosine and I use all the data in the document term matrix - I have also take a new random sample of the data so this data does not match up with the previous data).
                  1019695_10-K_2006 718937_10-K_2006 708955_10-K_2006 923120_10-K_2006 1020569_10-K_2006 862022_10-K_2006
1019695_10-K_2005        0.06690147       0.26848699       0.52009095       0.29421497        0.27183372        0.4673677
718937_10-K_2005         0.21579128       0.03183972       0.44026262       0.26678393        0.24644321        0.4339234
708955_10-K_2005         0.51919906       0.44900795       0.02992449       0.40760294        0.39043990        0.4338723
923120_10-K_2005         0.35596766       0.32048006       0.43839797       0.07794912        0.25703208        0.4123749
1020569_10-K_2005        0.27958200       0.24791561       0.39780292       0.19322863        0.01679282        0.3915167
862022_10-K_2005         0.51707930       0.49270230       0.44924855       0.45008895        0.45454247        0.0887527
917857_10-K_2005         0.30562057       0.27731399       0.41435485       0.22840343        0.22982293        0.4053557
                  917857_10-K_2006
1019695_10-K_2005       0.30368532
718937_10-K_2005        0.25491939
708955_10-K_2005        0.42074617
923120_10-K_2005        0.30625747
1020569_10-K_2005       0.22772452
862022_10-K_2005        0.48192247
917857_10-K_2005        0.03438092
This gets me what I want - but gives too many calculations. That is I am only interested in the diagonal of this matrix where the values are 0.06690147, 0.06690147, 0.02992449 and so on. Which correspond to the points in the matching data here:
     [,1] [,2]
[1,]    1    2
[2,]    3    5
[3,]    7    8
[4,]    9   10
[5,]   12   13
[6,]   15   16
[7,]   18   19
These points correspond to the row locations in the dtm matix.
> dtm[,1:10]
19 x 10 sparse Matrix of class "dgCMatrix"
   [[ suppressing 10 column names ‘reacting’, ‘ments’, ‘proper’ ... ]]
1019695_10-K_2005  . . . . . . . . . .
1019695_10-K_2006  . . . . . . . . 1 1
718937_10-K_2005   . . . . . . . . . .
718937_10-K/A_2005 . . . . . . . . . .
718937_10-K_2006   . . . . . . . . . .
1034258_10-K_2006  . . . 1 . . . . . .
708955_10-K_2005   . . . . . . . . . .
708955_10-K_2006   . . . . . . . . . .
923120_10-K_2005   . . . . . . . . . .
923120_10-K_2006   . . . . . . . . . .
923120_10-K/A_2006 . . . . . . . . . .
1020569_10-K_2005  . . . . . . . . . .
1020569_10-K_2006  1 . . . . . 1 . . .
1009463_10-K_2005  . . . . . 1 . . . .
862022_10-K_2005   . . . . . . . . . .
862022_10-K_2006   . . 1 . . . . . . .
868271_10-K_2005   . 1 . . . . . 1 . .
917857_10-K_2005   . . . . . . . . . .
917857_10-K_2006   . . . . 1 . . . . .
That is I should obtain a result of 7 - which are the diagonal of the dist2 matrix.
Applying all your functions gives the following:
Method 1:
> apply(matching, 1, function(x) dist2(as.matrix(dtm[x,]), method = rwmd, norm = 'none'))
Error in method$dist2(x, y) : 
  inherits(x, "sparseMatrix") && inherits(y, "sparseMatrix") is not TRUE
Called from: method$dist2(x, y)
Method 2:
> apply(matching, 1, function(x) dist2((dtm[x,]), method = rwmd, norm = 'none'))
  |====================================================================================================| 100%
  |====================================================================================================| 100%
  |====================================================================================================| 100%
  |====================================================================================================| 100%
  |====================================================================================================| 100%
  |====================================================================================================| 100%
  |====================================================================================================| 100%
                           [,1]                       [,2]                       [,3]
[1,] -0.00000000000000001804112 -0.00000000000000001518568 -0.00000000000000003168025
[2,]  0.06690147056044426499000  0.03183972474513259431905  0.02992448660488894462972
[3,]  0.06690147056044426499000  0.03183972474513259431905  0.02992448660488894462972
[4,] -0.00000000000000002283564 -0.00000000000000001232901 -0.00000000000000003952019
                           [,4]                        [,5]                       [,6]
[1,] -0.00000000000000001162810 -0.000000000000000009077403 -0.00000000000000003039822
[2,]  0.07794911930538156452641  0.016792819916915013161995  0.08875270114006890420644
[3,]  0.07794911930538156452641  0.016792819916915013161995  0.08875270114006890420644
[4,] -0.00000000000000001939834 -0.000000000000000009394918 -0.00000000000000004965902
                           [,7]
[1,] -0.00000000000000001829033
[2,]  0.03438092421044294105803
[3,]  0.03438092421044294105803
[4,] -0.00000000000000001748001
(Which gives some of the correct results from the diagonal but also some additional results)
This will loop through each row of your matching matrix and execute the line that you said works:
apply(matching, 1, function(x) dist2(as.matrix(dtm[x,]), method = 'cosine', norm = 'none'))
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,]   -2    1    1   -1    1    1    1    0
[2,]    1    1    1    1    1    1    1    1
[3,]    1    1    1    1    1    1    1    1
[4,]    1    1    0   -1   -1    0   -3    1
Or, if you want to keep the naming conventions, you can skip the conversion of the as.matrix:
res<-apply(matching, 1, function(x) dist2((dtm[x,]), method = 'cosine', norm = 'none'))
res
[[1]]
2 x 2 Matrix of class "dgeMatrix"
                  1108058_10-K_2005 1108058_10-K_2006
1108058_10-K_2005                -2                 1
1108058_10-K_2006                 1                 1
[[2]]
2 x 2 Matrix of class "dgeMatrix"
                64040_10-K_2005 64040_10-K_2006
64040_10-K_2005               1               1
64040_10-K_2006               1               1
#6 more list items...
And if you don't like working with lists, you can convert your list to an array:
library(abind)
abind::abind(lapply(res, as.matrix), along = 3)
, , 1
                63276_10-K_2005 63276_10-K_2006
63276_10-K_2005              -2               1
63276_10-K_2006               1               1
, , 2
                63276_10-K_2005 63276_10-K_2006
63276_10-K_2005               1               1
63276_10-K_2006               1               1
#6 more matrix slices...
Separately, your attempt at an apply statement tried to pass two variables x and y. The apply() only passes 1 variable - the row vector. Instead, you have to subset:
apply(matching, 1, function(x) sum(x[1],x[2]))
[1]  3 11 15 19 24 29 33 37
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With