I have a vector of objects (object
) along with a corresponding vector of time frames (tframe
) in which the objects were observed. For each unique pair of objects, I want to calculate the number of time frames in which both objects were observed.
I can write the code using for()
loops, but it takes a long time to run as the number of unique objects increases. How might I change the code to speed up the run time?
Below is an example with 4 unique objects (in reality I have about 300). For example, objects a
and c
were both observed in time frames 1
and 2
, so they get a count of 2
. Objects b
and d
were never observed in the same time frame, so they get a count of 0
.
object <- c("a", "a", "a", "b", "b", "c", "c", "c", "c", "d")
tframe <- c(1, 1, 2, 2, 3, 1, 2, 2, 3, 1)
uo <- unique(object)
n <- length(uo)
mpairs <- matrix(NA, nrow=n*(n-1)/2, ncol=3, dimnames=list(NULL,
c("obj1", "obj2", "sametf")))
row <- 0
for(i in 1:(n-1)) {
for(j in (i+1):n) {
row <- row+1
mpairs[row, "obj1"] <- uo[i]
mpairs[row, "obj2"] <- uo[j]
# no. of time frames in which both objects in a pair were observed
intwin <- intersect(tframe[object==uo[i]], tframe[object==uo[j]])
mpairs[row, "sametf"] <- length(intwin)
}}
data.frame(object, tframe)
object tframe
1 a 1
2 a 1
3 a 2
4 b 2
5 b 3
6 c 1
7 c 2
8 c 2
9 c 3
10 d 1
mpairs
obj1 obj2 sametf
[1,] "a" "b" "1"
[2,] "a" "c" "2"
[3,] "a" "d" "1"
[4,] "b" "c" "2"
[5,] "b" "d" "0"
[6,] "c" "d" "1"
Unique observations are also often interpreted to mean those that occur precisely once in the data. Thus, if values of a variable are 1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, and 5, then in one sense of “unique”, there are five distinct or unique values—namely, 1, 2, 3, 4,...
In that case, we can count the unique values using the approx_count_distinct function (there is also a version that lets you define the maximal approximation error). When we use that function, Spark counts the distinct elements using a variant of the HyperLogLog algorithm.
First, be aware that codebook reports their number, albeit as “unique values”. This command may be sufficient for your needs. Alternatively, contract will reduce the dataset to distinct observations and their frequencies.
Suppose we don’t need the accurate count, and an approximation is good enough. In that case, we can count the unique values using the approx_count_distinct function (there is also a version that lets you define the maximal approximation error).
You can use crossproduct
to get the counts of agreement. You can then reshape the
data, if required.
Example
object <- c("a", "a", "a", "b", "b", "c", "c", "c", "c", "d")
tframe <- c(1, 1, 2, 2, 3, 1, 2, 2, 3, 1)
# This will give you the counts
# Use code from Jean's comment
tab <- tcrossprod(table(object, tframe)>0)
# Reshape the data
tab[lower.tri(tab, TRUE)] <- NA
reshape2::melt(tab, na.rm=TRUE)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With