Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

applying a dist function rowise in custom function

Tags:

r

This is a follow on question to an earlier post I made here - I think I made significant progress and now the question has changed.

I have a "matching" matrix which looks like the following:

    [,1] [,2]
[1,]    1    2
[2,]    5    6
[3,]    7    8
[4,]    9   10
[5,]   11   13
[6,]   14   15
[7,]   16   17
[8,]   18   19

I also have a dtm - document term matrix:

1108058_10-K_2005  . . . . . . . 1 . . . . 1 . . . . 1 . .
1108058_10-K_2006  . . . . . . . . . . . . . . . . . . . .
72243_10-K_2005    . . . . . . . . . . . . . . . . . . . .
1352341_10-K_2006  1 . 1 . . 1 . . . . . . . . 1 . . . . .
64040_10-K_2005    . . . . . . . . . . . . . . . . . . . .
64040_10-K_2006    . . . . . . . . . . . . . . . . . . . .
1111247_10-K_2005  . . . . . . . . . . . . . . . . . . . .
1111247_10-K_2006  . . . . 1 . . . . . . . . . . . . . . .
1129425_10-K_2005  . . . . . . . . . . 1 1 . . . . . . . .
1129425_10-K_2006  . . . . . . . . . . . . . . . 1 1 . . .
943894_10-K_2005   . . . . . . . . . . . . . . . . . . . .
943894_10-K/A_2005 . . . . . . . . . . . . . . . . . . . .
943894_10-K_2006   . . . 1 . . . . . 1 . . . . . . . . . .
1176316_10-K_2005  . . . . . . . . . . . . . . . . . . . .
1176316_10-K_2006  . . . . . . 1 . . . . . . . . . . . . .
805305_10-K_2005   . . . . . . . . . . . . . . . . . . . .
805305_10-K_2006   . 1 . . . . . . . . . . . 1 . . . . 1 1
63276_10-K_2005    . . . . . . . . 1 . . . . . . . . . . .
63276_10-K_2006    . . . . . . . . . . . . . . . . . . . .

I can run the following dist function:

dist2(dtm[matching[, 1], ], dtm[matching[, 2], ], method = "cosine", norm = "none")

Which outputs:

WARN [2019-09-11 20:51:40] Sparsity will be lost - worth to calculate similarity instead of distance.
8 x 8 Matrix of class "dgeMatrix"
                  1108058_10-K_2006 64040_10-K_2006 1111247_10-K_2006 1129425_10-K_2006
1108058_10-K_2005                 1               1                 1                 1
64040_10-K_2005                   1               1                 1                 1
1111247_10-K_2005                 1               1                 1                 1
1129425_10-K_2005                 1               1                 1                 1
943894_10-K_2005                  1               1                 1                 1
1176316_10-K_2005                 1               1                 1                 1
805305_10-K_2005                  1               1                 1                 1
63276_10-K_2005                   1               1                 1                 1
                  943894_10-K_2006 1176316_10-K_2006 805305_10-K_2006 63276_10-K_2006
1108058_10-K_2005                1                 1                1               1
64040_10-K_2005                  1                 1                1               1
1111247_10-K_2005                1                 1                1               1
1129425_10-K_2005                1                 1                1               1
943894_10-K_2005                 1                 1                1               1
1176316_10-K_2005                1                 1                1               1
805305_10-K_2005                 1                 1                1               1
63276_10-K_2005                  1                 1                1               1

Which almost does what I want but not quite. It is still calculating "too" many calculations. I want to calculate the dist2 function according to the "rowise" observations in matching. That is calculate dist2 for observation 1 and 2. Then calculate the next dist2 for observation 5 and 6 and then 7 and 8 and so on.

Data:

library(text2vec)

matching <- structure(c(1, 5, 7, 9, 11, 14, 16, 18, 2, 6, 8, 10, 13, 15, 
17, 19), .Dim = c(8L, 2L))


dtm <- new("dgCMatrix", i = c(3L, 16L, 3L, 12L, 7L, 3L, 14L, 0L, 17L, 
12L, 8L, 8L, 0L, 16L, 3L, 9L, 9L, 0L, 16L, 16L), p = 0:20, Dim = 19:20, 
    Dimnames = list(c("1108058_10-K_2005", "1108058_10-K_2006", 
    "72243_10-K_2005", "1352341_10-K_2006", "64040_10-K_2005", 
    "64040_10-K_2006", "1111247_10-K_2005", "1111247_10-K_2006", 
    "1129425_10-K_2005", "1129425_10-K_2006", "943894_10-K_2005", 
    "943894_10-K/A_2005", "943894_10-K_2006", "1176316_10-K_2005", 
    "1176316_10-K_2006", "805305_10-K_2005", "805305_10-K_2006", 
    "63276_10-K_2005", "63276_10-K_2006"), c("counterclaim", 
    "reacting", "dissipating", "delisted", "trades", "relocated", 
    "buyers", "allege", "wind", "antiquated", "initiating", "detract", 
    "instat", "putters", "confronted", "enrolling", "futility", 
    "repatriating", "oppose", "communicates")), x = c(1, 1, 1, 
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), factors = list())

EDIT (my attempts are incorrect): This allows me to apply the dist function on the first row:

  m1 <- as.matrix(dtm[matching[1, ], ])
  dist2(m1, method = "cosine", norm = "none")[1, 2]

Applying it on the second row:

  m1 <- as.matrix(dtm[matching[2, ], ])
  dist2(m1, method = "cosine", norm = "none")

Just need to iterate and create a function to apply it over all rows.

A hacked together some sort of solution (not complete):

for(i in 1:nrow(matching)){
  m <- as.matrix(dtm[matching[i, ], ])
  dist <- dist2(m, method = "cosine", norm = "none")[1, 2]
  print(dist)
}

If anybody can help make this into a function that would be great!

This doesn't give me the correct result

foo <- function(data){
  col1 = data[, 1]
  col2 = data[, 2]
  dist = dist2(dtm[col1, ], dtm[col2, ], method = "cosine", norm = "none")
  return(dist)
}

foo(matching)

or this (does not work):

apply(matching, 1, function(x, y) dist2(dtm[x, ], dtm[y, ], method = "cosine", norm = "norm"))

EDIT:

When I apply the "full" function over the matching data I get a matrix like this: dist2(dtm[matching[, 1], ], dtm[matching[, 2], ], method = rwmd, norm = "none")

(Note: I use a custom method rwmd instead of cosine and I use all the data in the document term matrix - I have also take a new random sample of the data so this data does not match up with the previous data).

                  1019695_10-K_2006 718937_10-K_2006 708955_10-K_2006 923120_10-K_2006 1020569_10-K_2006 862022_10-K_2006
1019695_10-K_2005        0.06690147       0.26848699       0.52009095       0.29421497        0.27183372        0.4673677
718937_10-K_2005         0.21579128       0.03183972       0.44026262       0.26678393        0.24644321        0.4339234
708955_10-K_2005         0.51919906       0.44900795       0.02992449       0.40760294        0.39043990        0.4338723
923120_10-K_2005         0.35596766       0.32048006       0.43839797       0.07794912        0.25703208        0.4123749
1020569_10-K_2005        0.27958200       0.24791561       0.39780292       0.19322863        0.01679282        0.3915167
862022_10-K_2005         0.51707930       0.49270230       0.44924855       0.45008895        0.45454247        0.0887527
917857_10-K_2005         0.30562057       0.27731399       0.41435485       0.22840343        0.22982293        0.4053557
                  917857_10-K_2006
1019695_10-K_2005       0.30368532
718937_10-K_2005        0.25491939
708955_10-K_2005        0.42074617
923120_10-K_2005        0.30625747
1020569_10-K_2005       0.22772452
862022_10-K_2005        0.48192247
917857_10-K_2005        0.03438092

This gets me what I want - but gives too many calculations. That is I am only interested in the diagonal of this matrix where the values are 0.06690147, 0.06690147, 0.02992449 and so on. Which correspond to the points in the matching data here:

     [,1] [,2]
[1,]    1    2
[2,]    3    5
[3,]    7    8
[4,]    9   10
[5,]   12   13
[6,]   15   16
[7,]   18   19

These points correspond to the row locations in the dtm matix.

> dtm[,1:10]
19 x 10 sparse Matrix of class "dgCMatrix"
   [[ suppressing 10 column names ‘reacting’, ‘ments’, ‘proper’ ... ]]

1019695_10-K_2005  . . . . . . . . . .
1019695_10-K_2006  . . . . . . . . 1 1
718937_10-K_2005   . . . . . . . . . .
718937_10-K/A_2005 . . . . . . . . . .
718937_10-K_2006   . . . . . . . . . .
1034258_10-K_2006  . . . 1 . . . . . .
708955_10-K_2005   . . . . . . . . . .
708955_10-K_2006   . . . . . . . . . .
923120_10-K_2005   . . . . . . . . . .
923120_10-K_2006   . . . . . . . . . .
923120_10-K/A_2006 . . . . . . . . . .
1020569_10-K_2005  . . . . . . . . . .
1020569_10-K_2006  1 . . . . . 1 . . .
1009463_10-K_2005  . . . . . 1 . . . .
862022_10-K_2005   . . . . . . . . . .
862022_10-K_2006   . . 1 . . . . . . .
868271_10-K_2005   . 1 . . . . . 1 . .
917857_10-K_2005   . . . . . . . . . .
917857_10-K_2006   . . . . 1 . . . . .

That is I should obtain a result of 7 - which are the diagonal of the dist2 matrix.

EDIT 2:

Applying all your functions gives the following:

Method 1:

> apply(matching, 1, function(x) dist2(as.matrix(dtm[x,]), method = rwmd, norm = 'none'))
Error in method$dist2(x, y) : 
  inherits(x, "sparseMatrix") && inherits(y, "sparseMatrix") is not TRUE
Called from: method$dist2(x, y)

Method 2:

> apply(matching, 1, function(x) dist2((dtm[x,]), method = rwmd, norm = 'none'))
  |====================================================================================================| 100%
  |====================================================================================================| 100%
  |====================================================================================================| 100%
  |====================================================================================================| 100%
  |====================================================================================================| 100%
  |====================================================================================================| 100%
  |====================================================================================================| 100%
                           [,1]                       [,2]                       [,3]
[1,] -0.00000000000000001804112 -0.00000000000000001518568 -0.00000000000000003168025
[2,]  0.06690147056044426499000  0.03183972474513259431905  0.02992448660488894462972
[3,]  0.06690147056044426499000  0.03183972474513259431905  0.02992448660488894462972
[4,] -0.00000000000000002283564 -0.00000000000000001232901 -0.00000000000000003952019
                           [,4]                        [,5]                       [,6]
[1,] -0.00000000000000001162810 -0.000000000000000009077403 -0.00000000000000003039822
[2,]  0.07794911930538156452641  0.016792819916915013161995  0.08875270114006890420644
[3,]  0.07794911930538156452641  0.016792819916915013161995  0.08875270114006890420644
[4,] -0.00000000000000001939834 -0.000000000000000009394918 -0.00000000000000004965902
                           [,7]
[1,] -0.00000000000000001829033
[2,]  0.03438092421044294105803
[3,]  0.03438092421044294105803
[4,] -0.00000000000000001748001

(Which gives some of the correct results from the diagonal but also some additional results)

like image 979
user113156 Avatar asked Nov 07 '22 13:11

user113156


1 Answers

This will loop through each row of your matching matrix and execute the line that you said works:

apply(matching, 1, function(x) dist2(as.matrix(dtm[x,]), method = 'cosine', norm = 'none'))

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,]   -2    1    1   -1    1    1    1    0
[2,]    1    1    1    1    1    1    1    1
[3,]    1    1    1    1    1    1    1    1
[4,]    1    1    0   -1   -1    0   -3    1

Or, if you want to keep the naming conventions, you can skip the conversion of the as.matrix:

res<-apply(matching, 1, function(x) dist2((dtm[x,]), method = 'cosine', norm = 'none'))
res

[[1]]
2 x 2 Matrix of class "dgeMatrix"
                  1108058_10-K_2005 1108058_10-K_2006
1108058_10-K_2005                -2                 1
1108058_10-K_2006                 1                 1

[[2]]
2 x 2 Matrix of class "dgeMatrix"
                64040_10-K_2005 64040_10-K_2006
64040_10-K_2005               1               1
64040_10-K_2006               1               1

#6 more list items...

And if you don't like working with lists, you can convert your list to an array:

library(abind)
abind::abind(lapply(res, as.matrix), along = 3)

, , 1

                63276_10-K_2005 63276_10-K_2006
63276_10-K_2005              -2               1
63276_10-K_2006               1               1

, , 2

                63276_10-K_2005 63276_10-K_2006
63276_10-K_2005               1               1
63276_10-K_2006               1               1

#6 more matrix slices...

Separately, your attempt at an apply statement tried to pass two variables x and y. The apply() only passes 1 variable - the row vector. Instead, you have to subset:

apply(matching, 1, function(x) sum(x[1],x[2]))

[1]  3 11 15 19 24 29 33 37
like image 199
Cole Avatar answered Nov 15 '22 06:11

Cole