This is a follow on question to an earlier post I made here - I think I made significant progress and now the question has changed.
I have a "matching" matrix which looks like the following:
[,1] [,2]
[1,] 1 2
[2,] 5 6
[3,] 7 8
[4,] 9 10
[5,] 11 13
[6,] 14 15
[7,] 16 17
[8,] 18 19
I also have a dtm
- document term matrix:
1108058_10-K_2005 . . . . . . . 1 . . . . 1 . . . . 1 . .
1108058_10-K_2006 . . . . . . . . . . . . . . . . . . . .
72243_10-K_2005 . . . . . . . . . . . . . . . . . . . .
1352341_10-K_2006 1 . 1 . . 1 . . . . . . . . 1 . . . . .
64040_10-K_2005 . . . . . . . . . . . . . . . . . . . .
64040_10-K_2006 . . . . . . . . . . . . . . . . . . . .
1111247_10-K_2005 . . . . . . . . . . . . . . . . . . . .
1111247_10-K_2006 . . . . 1 . . . . . . . . . . . . . . .
1129425_10-K_2005 . . . . . . . . . . 1 1 . . . . . . . .
1129425_10-K_2006 . . . . . . . . . . . . . . . 1 1 . . .
943894_10-K_2005 . . . . . . . . . . . . . . . . . . . .
943894_10-K/A_2005 . . . . . . . . . . . . . . . . . . . .
943894_10-K_2006 . . . 1 . . . . . 1 . . . . . . . . . .
1176316_10-K_2005 . . . . . . . . . . . . . . . . . . . .
1176316_10-K_2006 . . . . . . 1 . . . . . . . . . . . . .
805305_10-K_2005 . . . . . . . . . . . . . . . . . . . .
805305_10-K_2006 . 1 . . . . . . . . . . . 1 . . . . 1 1
63276_10-K_2005 . . . . . . . . 1 . . . . . . . . . . .
63276_10-K_2006 . . . . . . . . . . . . . . . . . . . .
I can run the following dist
function:
dist2(dtm[matching[, 1], ], dtm[matching[, 2], ], method = "cosine", norm = "none")
Which outputs:
WARN [2019-09-11 20:51:40] Sparsity will be lost - worth to calculate similarity instead of distance.
8 x 8 Matrix of class "dgeMatrix"
1108058_10-K_2006 64040_10-K_2006 1111247_10-K_2006 1129425_10-K_2006
1108058_10-K_2005 1 1 1 1
64040_10-K_2005 1 1 1 1
1111247_10-K_2005 1 1 1 1
1129425_10-K_2005 1 1 1 1
943894_10-K_2005 1 1 1 1
1176316_10-K_2005 1 1 1 1
805305_10-K_2005 1 1 1 1
63276_10-K_2005 1 1 1 1
943894_10-K_2006 1176316_10-K_2006 805305_10-K_2006 63276_10-K_2006
1108058_10-K_2005 1 1 1 1
64040_10-K_2005 1 1 1 1
1111247_10-K_2005 1 1 1 1
1129425_10-K_2005 1 1 1 1
943894_10-K_2005 1 1 1 1
1176316_10-K_2005 1 1 1 1
805305_10-K_2005 1 1 1 1
63276_10-K_2005 1 1 1 1
Which almost does what I want but not quite. It is still calculating "too" many calculations. I want to calculate the dist2
function according to the "rowise" observations in matching
. That is calculate dist2
for observation 1
and 2
. Then calculate the next dist2
for observation 5
and 6
and then 7
and 8
and so on.
Data:
library(text2vec)
matching <- structure(c(1, 5, 7, 9, 11, 14, 16, 18, 2, 6, 8, 10, 13, 15,
17, 19), .Dim = c(8L, 2L))
dtm <- new("dgCMatrix", i = c(3L, 16L, 3L, 12L, 7L, 3L, 14L, 0L, 17L,
12L, 8L, 8L, 0L, 16L, 3L, 9L, 9L, 0L, 16L, 16L), p = 0:20, Dim = 19:20,
Dimnames = list(c("1108058_10-K_2005", "1108058_10-K_2006",
"72243_10-K_2005", "1352341_10-K_2006", "64040_10-K_2005",
"64040_10-K_2006", "1111247_10-K_2005", "1111247_10-K_2006",
"1129425_10-K_2005", "1129425_10-K_2006", "943894_10-K_2005",
"943894_10-K/A_2005", "943894_10-K_2006", "1176316_10-K_2005",
"1176316_10-K_2006", "805305_10-K_2005", "805305_10-K_2006",
"63276_10-K_2005", "63276_10-K_2006"), c("counterclaim",
"reacting", "dissipating", "delisted", "trades", "relocated",
"buyers", "allege", "wind", "antiquated", "initiating", "detract",
"instat", "putters", "confronted", "enrolling", "futility",
"repatriating", "oppose", "communicates")), x = c(1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), factors = list())
EDIT (my attempts are incorrect): This allows me to apply the dist
function on the first row:
m1 <- as.matrix(dtm[matching[1, ], ])
dist2(m1, method = "cosine", norm = "none")[1, 2]
Applying it on the second row:
m1 <- as.matrix(dtm[matching[2, ], ])
dist2(m1, method = "cosine", norm = "none")
Just need to iterate and create a function to apply it over all rows.
A hacked together some sort of solution (not complete):
for(i in 1:nrow(matching)){
m <- as.matrix(dtm[matching[i, ], ])
dist <- dist2(m, method = "cosine", norm = "none")[1, 2]
print(dist)
}
If anybody can help make this into a function that would be great!
This doesn't give me the correct result
foo <- function(data){
col1 = data[, 1]
col2 = data[, 2]
dist = dist2(dtm[col1, ], dtm[col2, ], method = "cosine", norm = "none")
return(dist)
}
foo(matching)
or this (does not work):
apply(matching, 1, function(x, y) dist2(dtm[x, ], dtm[y, ], method = "cosine", norm = "norm"))
When I apply the "full" function over the matching
data I get a matrix like this: dist2(dtm[matching[, 1], ], dtm[matching[, 2], ], method = rwmd, norm = "none")
(Note: I use a custom method rwmd
instead of cosine
and I use all the data in the document term matrix - I have also take a new random sample of the data so this data does not match up with the previous data).
1019695_10-K_2006 718937_10-K_2006 708955_10-K_2006 923120_10-K_2006 1020569_10-K_2006 862022_10-K_2006
1019695_10-K_2005 0.06690147 0.26848699 0.52009095 0.29421497 0.27183372 0.4673677
718937_10-K_2005 0.21579128 0.03183972 0.44026262 0.26678393 0.24644321 0.4339234
708955_10-K_2005 0.51919906 0.44900795 0.02992449 0.40760294 0.39043990 0.4338723
923120_10-K_2005 0.35596766 0.32048006 0.43839797 0.07794912 0.25703208 0.4123749
1020569_10-K_2005 0.27958200 0.24791561 0.39780292 0.19322863 0.01679282 0.3915167
862022_10-K_2005 0.51707930 0.49270230 0.44924855 0.45008895 0.45454247 0.0887527
917857_10-K_2005 0.30562057 0.27731399 0.41435485 0.22840343 0.22982293 0.4053557
917857_10-K_2006
1019695_10-K_2005 0.30368532
718937_10-K_2005 0.25491939
708955_10-K_2005 0.42074617
923120_10-K_2005 0.30625747
1020569_10-K_2005 0.22772452
862022_10-K_2005 0.48192247
917857_10-K_2005 0.03438092
This gets me what I want - but gives too many calculations. That is I am only interested in the diagonal
of this matrix where the values are 0.06690147
, 0.06690147
, 0.02992449
and so on. Which correspond to the points in the matching
data here:
[,1] [,2]
[1,] 1 2
[2,] 3 5
[3,] 7 8
[4,] 9 10
[5,] 12 13
[6,] 15 16
[7,] 18 19
These points correspond to the row locations in the dtm
matix.
> dtm[,1:10]
19 x 10 sparse Matrix of class "dgCMatrix"
[[ suppressing 10 column names ‘reacting’, ‘ments’, ‘proper’ ... ]]
1019695_10-K_2005 . . . . . . . . . .
1019695_10-K_2006 . . . . . . . . 1 1
718937_10-K_2005 . . . . . . . . . .
718937_10-K/A_2005 . . . . . . . . . .
718937_10-K_2006 . . . . . . . . . .
1034258_10-K_2006 . . . 1 . . . . . .
708955_10-K_2005 . . . . . . . . . .
708955_10-K_2006 . . . . . . . . . .
923120_10-K_2005 . . . . . . . . . .
923120_10-K_2006 . . . . . . . . . .
923120_10-K/A_2006 . . . . . . . . . .
1020569_10-K_2005 . . . . . . . . . .
1020569_10-K_2006 1 . . . . . 1 . . .
1009463_10-K_2005 . . . . . 1 . . . .
862022_10-K_2005 . . . . . . . . . .
862022_10-K_2006 . . 1 . . . . . . .
868271_10-K_2005 . 1 . . . . . 1 . .
917857_10-K_2005 . . . . . . . . . .
917857_10-K_2006 . . . . 1 . . . . .
That is I should obtain a result of 7
- which are the diagonal of the dist2
matrix.
Applying all your functions gives the following:
Method 1:
> apply(matching, 1, function(x) dist2(as.matrix(dtm[x,]), method = rwmd, norm = 'none'))
Error in method$dist2(x, y) :
inherits(x, "sparseMatrix") && inherits(y, "sparseMatrix") is not TRUE
Called from: method$dist2(x, y)
Method 2:
> apply(matching, 1, function(x) dist2((dtm[x,]), method = rwmd, norm = 'none'))
|====================================================================================================| 100%
|====================================================================================================| 100%
|====================================================================================================| 100%
|====================================================================================================| 100%
|====================================================================================================| 100%
|====================================================================================================| 100%
|====================================================================================================| 100%
[,1] [,2] [,3]
[1,] -0.00000000000000001804112 -0.00000000000000001518568 -0.00000000000000003168025
[2,] 0.06690147056044426499000 0.03183972474513259431905 0.02992448660488894462972
[3,] 0.06690147056044426499000 0.03183972474513259431905 0.02992448660488894462972
[4,] -0.00000000000000002283564 -0.00000000000000001232901 -0.00000000000000003952019
[,4] [,5] [,6]
[1,] -0.00000000000000001162810 -0.000000000000000009077403 -0.00000000000000003039822
[2,] 0.07794911930538156452641 0.016792819916915013161995 0.08875270114006890420644
[3,] 0.07794911930538156452641 0.016792819916915013161995 0.08875270114006890420644
[4,] -0.00000000000000001939834 -0.000000000000000009394918 -0.00000000000000004965902
[,7]
[1,] -0.00000000000000001829033
[2,] 0.03438092421044294105803
[3,] 0.03438092421044294105803
[4,] -0.00000000000000001748001
(Which gives some of the correct results from the diagonal but also some additional results)
This will loop through each row of your matching
matrix and execute the line that you said works:
apply(matching, 1, function(x) dist2(as.matrix(dtm[x,]), method = 'cosine', norm = 'none'))
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] -2 1 1 -1 1 1 1 0
[2,] 1 1 1 1 1 1 1 1
[3,] 1 1 1 1 1 1 1 1
[4,] 1 1 0 -1 -1 0 -3 1
Or, if you want to keep the naming conventions, you can skip the conversion of the as.matrix
:
res<-apply(matching, 1, function(x) dist2((dtm[x,]), method = 'cosine', norm = 'none'))
res
[[1]]
2 x 2 Matrix of class "dgeMatrix"
1108058_10-K_2005 1108058_10-K_2006
1108058_10-K_2005 -2 1
1108058_10-K_2006 1 1
[[2]]
2 x 2 Matrix of class "dgeMatrix"
64040_10-K_2005 64040_10-K_2006
64040_10-K_2005 1 1
64040_10-K_2006 1 1
#6 more list items...
And if you don't like working with lists, you can convert your list to an array:
library(abind)
abind::abind(lapply(res, as.matrix), along = 3)
, , 1
63276_10-K_2005 63276_10-K_2006
63276_10-K_2005 -2 1
63276_10-K_2006 1 1
, , 2
63276_10-K_2005 63276_10-K_2006
63276_10-K_2005 1 1
63276_10-K_2006 1 1
#6 more matrix slices...
Separately, your attempt at an apply statement tried to pass two variables x
and y
. The apply()
only passes 1 variable - the row vector. Instead, you have to subset:
apply(matching, 1, function(x) sum(x[1],x[2]))
[1] 3 11 15 19 24 29 33 37
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With