Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

The which function in R is not giving the desired output

I have a matrix that contains 3 columns and in total 10,000 elements. First and second columns are indexes and third column is the score. I want to normalize the score column based on this formula:

Normalized_score_i_j = score_i_j / ((sqrt(score_i_i) * (sqrt(score_j_j))

score_i_j = the current score itself

score_i_i = look at current score's index in first column, and in the dataset look for a score that has that index in both its first and second columns

score_j_j = look at current score's index in second column, and in the dataset look for a score that has that index in both its first and second columns

An example is for instance, if df is as follow:

df <- read.table(text = "
First.Protein,Second.Protein,Score
1,1,25
1,2,90
1,3,82
1,4,19
2,1,90
2,2,99
2,3,76
2,4,79
3,1,82
3,2,76
3,3,91
3,4,33
4,1,28
4,2,11
4,3,99
4,4,50
", header = TRUE, sep = ",")

If we are normalizing this row:

First.Protein Second.Protein Score
4             3              99

The normalized score will be:

The score itself divided by the sqrt of a score that its First.Protein and Second.Protein index are both 4 multiplied by the sqrt of a score where its First.Protein and Second.Protein indexes are both 3.

Therefore:

Normalized =  99 / (sqrt(50) * sqrt(91)) = 1.467674

I have the code below, but it is behaving very weirdly and is giving me values that are not at all normalized and are in fact very odd:

for(i in 1:nrow(Smith_Waterman_Scores))
{
  Smith_Waterman_Scores$Score[i] <- 
    Smith_Waterman_Scores$Score[i] / 
    (sqrt(Smith_Waterman_Scores$Score[which(Smith_Waterman_Scores$First.Protein==Smith_Waterman_Scores$First.Protein[i] & Smith_Waterman_Scores$Second.Protein==Smith_Waterman_Scores$First.Protein[i])])) *
    (sqrt(Smith_Waterman_Scores$Score[which(Smith_Waterman_Scores$First.Protein==Smith_Waterman_Scores$Second.Protein[i] & Smith_Waterman_Scores$Second.Protein==Smith_Waterman_Scores$Second.Protein[i])]))
}
like image 362
DoeNoe Avatar asked Dec 06 '22 18:12

DoeNoe


1 Answers

Here's a re-write of your original attempt (which() is not necessary; just use the logical vector for sub-setting; with() allows you to refer to variables in the data frame without having to re-type the name of the data.frame -- easier to read but also easier to make a mistake)

orig0 <- function(df) {
    for(i in 1:nrow(df)) {
        df$Score[i] <- with(df, {
            ii <- First.Protein == First.Protein[i] &
                Second.Protein == First.Protein[i]
            jj <- First.Protein == Second.Protein[i] &
                Second.Protein == Second.Protein[i]
            Score[i] / (sqrt(Score[ii]) * sqrt(Score[jj]))
        })
    }
    df$Score
}

The problem is that Score[ii] and Score[jj] appear on the right-hand side both before and after they've been updated. Here's a revision where the original columns are interpreted as 'read-only'

orig1 <- function(df) {
    normalized <- numeric(nrow(df))     # pre-allocate
    for(i in 1:nrow(df)) {
        normalized[i] <- with(df, {
            ii <- First.Protein == First.Protein[i] &
                Second.Protein == First.Protein[i]
            jj <- First.Protein == Second.Protein[i] &
                Second.Protein == Second.Protein[i]
            Score[i] / (sqrt(Score[ii]) * sqrt(Score[jj]))
        })
    }
    normalized
}

I think the results are now correct (see below). A better implementation would use sapply (or vapply) to avoid having to worry about the allocation of the return value

orig2 <- function(df) {
    sapply(seq_len(nrow(df)), function(i) {
        with(df, {
            ii <- First.Protein == First.Protein[i] &
                Second.Protein == First.Protein[i]
            jj <- First.Protein == Second.Protein[i] &
                Second.Protein == Second.Protein[i]
            Score[i] / (sqrt(Score[ii]) * sqrt(Score[jj]))
        })
    })
}

Now that the results are correct, we can ask about performance. Your solution requires a scan of, e.g., First.Protein, each time through the loop. There are N=nrow(df) elements of First.Protein, and you're going through the loop N times, so you'll be making a multiple of N * N = N^2 comparisons -- if you increase the size of the data frame from 10 to 100 rows, the time taken will change from 10 * 10 = 100 units, to 100 * 100 = 10000 units of time.

Several of the answers attempt to avoid that polynomial scaling. My answer does this using match() on a vector of values; this probably scales as N (each look-up occurs in constant time, and there are N look-ups), which is much better than polynomial.

Create a subset of data with identical first and second proteins

ii = df[df$First.Protein == df$Second.Protein,]

Here's the ijth score from the original data frame

s_ij = df$Score

Look up First.Protein of df in ii and record the score; likewise for Second.Protein

s_ii = ii[match(df$First.Protein, ii$First.Protein), "Score"]
s_jj = ii[match(df$Second.Protein, ii$Second.Protein), "Score"]

The normalized scores are then

> s_ij / (sqrt(s_ii) * sqrt(s_jj))
 [1] 1.0000000 1.8090681 1.7191871 0.5374012 1.8090681 1.0000000 0.8007101
 [8] 1.1228571 1.7191871 0.8007101 1.0000000 0.4892245 0.7919596 0.1563472
[15] 1.4676736 1.0000000

This will be fast, using a single call to match() instead of many calls to which() inside a for loop or tests for identity inside an apply() -- both of the latter make N^2 comparisons and so scale very poorly.

I summarized some of the proposed solutions as

f0 <- function(df) {
    contingency = xtabs(Score ~ ., df)
    diagonals <- unname(diag(contingency))
    i <- df$First.Protein
    j <- df$Second.Protein
    idx <- matrix(c(i, j), ncol=2)
    contingency[idx] / (sqrt(diagonals[i]) * sqrt(diagonals[j]))
}

f1 <- function(df) {
    ii = df[df$First.Protein == df$Second.Protein,]
    s_ij = df$Score
    s_ii = ii[match(df$First.Protein, ii$First.Protein), "Score"]
    s_jj = ii[match(df$Second.Protein, ii$Second.Protein), "Score"]
    s_ij / (sqrt(s_ii) * sqrt(s_jj))
}

f2 <- function(dt) {
    dt.lookup <- dt[First.Protein == Second.Protein]
    setkey(dt,"First.Protein" )
    setkey(dt.lookup,"First.Protein" )
    colnames(dt.lookup) <- c("First.Protein","Second.Protein","Score1")
    dt <- dt[dt.lookup]
    setkey(dt,"Second.Protein" )
    setkey(dt.lookup,"Second.Protein")
    colnames(dt.lookup) <- c("First.Protein","Second.Protein","Score2")
    dt[dt.lookup][
      , Normalized :=  Score / (sqrt(Score1) * sqrt(Score2))][
      , .(First.Protein, Second.Protein, Normalized)]
}

f3 <- function(dt) {
    eq = dt[First.Protein == Second.Protein]
    dt[eq, Score_ii := i.Score, on = "First.Protein"]
    dt[eq, Score_jj := i.Score, on = "Second.Protein"]
    dt[, Normalised := Score/sqrt(Score_ii * Score_jj)]
    dt[, c("Score_ii", "Score_jj") := NULL]
}

I know how to programmatically check that the first two generate consistent results; I don't know data.table well enough to get the normalized result out in the same order as the input columns for f2() so can't compare with the others (though they look correct 'by eye'). f3() produces numerically similar but not identical results

> identical(orig1(df), f0(df))
[1] TRUE
> identical(f0(df), f1(df))
[1] TRUE
> identical(f0(df), { f3(dt3); dt3[["Normalized"]] })  # pass by reference!
[1] FALSE
> all.equal(f0(df), { f3(dt3); dt3[["Normalized"]] })
[1] TRUE

There are performance differences

library(data.table)    
dt2 <- as.data.table(df)
dt3 <- as.data.table(df)

library(microbenchmark)
microbenchmark(f0(df), f1(df), f2(dt2), f3(dt3))

with

> microbenchmark(f0(df), f1(df), f2(df), f3(df))
Unit: microseconds
   expr      min        lq      mean    median       uq      max neval
 f0(df)  967.117  992.8365 1059.7076 1030.9710 1094.247 2384.360   100
 f1(df)  176.238  192.8610  210.4059  207.8865  219.687  333.260   100
 f2(df) 4884.922 4947.6650 5156.0985 5017.1785 5142.498 6785.975   100
 f3(df) 3281.185 3329.4440 3463.8073 3366.3825 3443.400 5144.430   100

The solutions f0 - f3 are likely to scale well (especially data.table) with real data; the fact that the times are in microseconds probably means that speed is not important (now that we are not implementing an N^2 algorithm).

On reflection, a more straight-forward impelementation of f1() just looks up the 'diagonal' elements

f1a <- function(df) {
    ii = df[df$First.Protein == df$Second.Protein, ]
    d = sqrt(ii$Score[order(ii$First.Protein)])
    df$Score / (d[df$First.Protein] * d[df$Second.Protein])
}    
like image 76
Martin Morgan Avatar answered Jan 05 '23 20:01

Martin Morgan