I have a matrix that contains 3 columns and in total 10,000 elements. First and second columns are indexes and third column is the score. I want to normalize the score column based on this formula:
Normalized_score_i_j = score_i_j / ((sqrt(score_i_i) * (sqrt(score_j_j))
score_i_j
= the current score itself
score_i_i
= look at current score's index in first column, and in the dataset look for a score that has that index in both its first and second columns
score_j_j
= look at current score's index in second column, and in the dataset look for a score that has that index in both its first and second columns
An example is for instance, if df is as follow:
df <- read.table(text = "
First.Protein,Second.Protein,Score
1,1,25
1,2,90
1,3,82
1,4,19
2,1,90
2,2,99
2,3,76
2,4,79
3,1,82
3,2,76
3,3,91
3,4,33
4,1,28
4,2,11
4,3,99
4,4,50
", header = TRUE, sep = ",")
If we are normalizing this row:
First.Protein Second.Protein Score
4 3 99
The normalized score will be:
The score itself divided by the sqrt of a score that its First.Protein and Second.Protein index are both 4 multiplied by the sqrt of a score where its First.Protein and Second.Protein indexes are both 3.
Therefore:
Normalized = 99 / (sqrt(50) * sqrt(91)) = 1.467674
I have the code below, but it is behaving very weirdly and is giving me values that are not at all normalized and are in fact very odd:
for(i in 1:nrow(Smith_Waterman_Scores))
{
Smith_Waterman_Scores$Score[i] <-
Smith_Waterman_Scores$Score[i] /
(sqrt(Smith_Waterman_Scores$Score[which(Smith_Waterman_Scores$First.Protein==Smith_Waterman_Scores$First.Protein[i] & Smith_Waterman_Scores$Second.Protein==Smith_Waterman_Scores$First.Protein[i])])) *
(sqrt(Smith_Waterman_Scores$Score[which(Smith_Waterman_Scores$First.Protein==Smith_Waterman_Scores$Second.Protein[i] & Smith_Waterman_Scores$Second.Protein==Smith_Waterman_Scores$Second.Protein[i])]))
}
Here's a re-write of your original attempt (which()
is not necessary; just use the logical vector for sub-setting; with()
allows you to refer to variables in the data frame without having to re-type the name of the data.frame -- easier to read but also easier to make a mistake)
orig0 <- function(df) {
for(i in 1:nrow(df)) {
df$Score[i] <- with(df, {
ii <- First.Protein == First.Protein[i] &
Second.Protein == First.Protein[i]
jj <- First.Protein == Second.Protein[i] &
Second.Protein == Second.Protein[i]
Score[i] / (sqrt(Score[ii]) * sqrt(Score[jj]))
})
}
df$Score
}
The problem is that Score[ii]
and Score[jj]
appear on the right-hand side both before and after they've been updated. Here's a revision where the original columns are interpreted as 'read-only'
orig1 <- function(df) {
normalized <- numeric(nrow(df)) # pre-allocate
for(i in 1:nrow(df)) {
normalized[i] <- with(df, {
ii <- First.Protein == First.Protein[i] &
Second.Protein == First.Protein[i]
jj <- First.Protein == Second.Protein[i] &
Second.Protein == Second.Protein[i]
Score[i] / (sqrt(Score[ii]) * sqrt(Score[jj]))
})
}
normalized
}
I think the results are now correct (see below). A better implementation would use sapply (or vapply) to avoid having to worry about the allocation of the return value
orig2 <- function(df) {
sapply(seq_len(nrow(df)), function(i) {
with(df, {
ii <- First.Protein == First.Protein[i] &
Second.Protein == First.Protein[i]
jj <- First.Protein == Second.Protein[i] &
Second.Protein == Second.Protein[i]
Score[i] / (sqrt(Score[ii]) * sqrt(Score[jj]))
})
})
}
Now that the results are correct, we can ask about performance. Your solution requires a scan of, e.g., First.Protein, each time through the loop. There are N=nrow(df) elements of First.Protein, and you're going through the loop N times, so you'll be making a multiple of N * N = N^2 comparisons -- if you increase the size of the data frame from 10 to 100 rows, the time taken will change from 10 * 10 = 100 units, to 100 * 100 = 10000 units of time.
Several of the answers attempt to avoid that polynomial scaling. My answer does this using match()
on a vector of values; this probably scales as N (each look-up occurs in constant time, and there are N look-ups), which is much better than polynomial.
Create a subset of data with identical first and second proteins
ii = df[df$First.Protein == df$Second.Protein,]
Here's the ijth score from the original data frame
s_ij = df$Score
Look up First.Protein of df
in ii
and record the score; likewise for Second.Protein
s_ii = ii[match(df$First.Protein, ii$First.Protein), "Score"]
s_jj = ii[match(df$Second.Protein, ii$Second.Protein), "Score"]
The normalized scores are then
> s_ij / (sqrt(s_ii) * sqrt(s_jj))
[1] 1.0000000 1.8090681 1.7191871 0.5374012 1.8090681 1.0000000 0.8007101
[8] 1.1228571 1.7191871 0.8007101 1.0000000 0.4892245 0.7919596 0.1563472
[15] 1.4676736 1.0000000
This will be fast, using a single call to match()
instead of many calls to which()
inside a for loop or tests for identity inside an apply()
-- both of the latter make N^2 comparisons and so scale very poorly.
I summarized some of the proposed solutions as
f0 <- function(df) {
contingency = xtabs(Score ~ ., df)
diagonals <- unname(diag(contingency))
i <- df$First.Protein
j <- df$Second.Protein
idx <- matrix(c(i, j), ncol=2)
contingency[idx] / (sqrt(diagonals[i]) * sqrt(diagonals[j]))
}
f1 <- function(df) {
ii = df[df$First.Protein == df$Second.Protein,]
s_ij = df$Score
s_ii = ii[match(df$First.Protein, ii$First.Protein), "Score"]
s_jj = ii[match(df$Second.Protein, ii$Second.Protein), "Score"]
s_ij / (sqrt(s_ii) * sqrt(s_jj))
}
f2 <- function(dt) {
dt.lookup <- dt[First.Protein == Second.Protein]
setkey(dt,"First.Protein" )
setkey(dt.lookup,"First.Protein" )
colnames(dt.lookup) <- c("First.Protein","Second.Protein","Score1")
dt <- dt[dt.lookup]
setkey(dt,"Second.Protein" )
setkey(dt.lookup,"Second.Protein")
colnames(dt.lookup) <- c("First.Protein","Second.Protein","Score2")
dt[dt.lookup][
, Normalized := Score / (sqrt(Score1) * sqrt(Score2))][
, .(First.Protein, Second.Protein, Normalized)]
}
f3 <- function(dt) {
eq = dt[First.Protein == Second.Protein]
dt[eq, Score_ii := i.Score, on = "First.Protein"]
dt[eq, Score_jj := i.Score, on = "Second.Protein"]
dt[, Normalised := Score/sqrt(Score_ii * Score_jj)]
dt[, c("Score_ii", "Score_jj") := NULL]
}
I know how to programmatically check that the first two generate consistent results; I don't know data.table well enough to get the normalized result out in the same order as the input columns for f2() so can't compare with the others (though they look correct 'by eye'). f3()
produces numerically similar but not identical results
> identical(orig1(df), f0(df))
[1] TRUE
> identical(f0(df), f1(df))
[1] TRUE
> identical(f0(df), { f3(dt3); dt3[["Normalized"]] }) # pass by reference!
[1] FALSE
> all.equal(f0(df), { f3(dt3); dt3[["Normalized"]] })
[1] TRUE
There are performance differences
library(data.table)
dt2 <- as.data.table(df)
dt3 <- as.data.table(df)
library(microbenchmark)
microbenchmark(f0(df), f1(df), f2(dt2), f3(dt3))
with
> microbenchmark(f0(df), f1(df), f2(df), f3(df))
Unit: microseconds
expr min lq mean median uq max neval
f0(df) 967.117 992.8365 1059.7076 1030.9710 1094.247 2384.360 100
f1(df) 176.238 192.8610 210.4059 207.8865 219.687 333.260 100
f2(df) 4884.922 4947.6650 5156.0985 5017.1785 5142.498 6785.975 100
f3(df) 3281.185 3329.4440 3463.8073 3366.3825 3443.400 5144.430 100
The solutions f0 - f3 are likely to scale well (especially data.table) with real data; the fact that the times are in microseconds probably means that speed is not important (now that we are not implementing an N^2 algorithm).
On reflection, a more straight-forward impelementation of f1()
just looks up the 'diagonal' elements
f1a <- function(df) {
ii = df[df$First.Protein == df$Second.Protein, ]
d = sqrt(ii$Score[order(ii$First.Protein)])
df$Score / (d[df$First.Protein] * d[df$Second.Protein])
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With