I have a large, sparse binary matrix (roughly 39,000 x 14,000; most rows have only a single "1" entry). I'd like to cluster similar rows together, but my initial plan takes too long to complete:
d <- dist(inputMatrix, method="binary")
hc <- hclust(d, method="complete")
The first step doesn't finish, so I'm not sure how the second step would fare. What are some approaches to efficiently grouping similar rows of a large, sparse, binary matrix in R?
I've written some Rcpp code and R code which works out the binary/Jaccard distance of a binary matrix approx. 80x faster than dist(x, method = "binary")
. It converts the input matrix into a raw matrix which is the transpose of the input (so that the bit patterns are in the correct order internally). This is then used in some C++ code which handles the data as 64 bit unsigned integers for speed. The Jaccard distance of two vectors x and y is equal to x ^ y / (x | y)
where ^
is the xor operator. The Hamming Weight calculation is used to count the number of bits set if the result of the xor
or or
is non-zero.
I've put together the code on github at https://github.com/NikNakk/binaryDist/ and reproduced the two files below. I've confirmed that the results are the same as dist(x, method = "binary")
for a few random datasets.
On a dataset of 39000 rows by 14000 columns with 1-5 ones per row, it took about 11 minutes. The output distance matrix was 5.7 GB.
#include <Rcpp.h>
using namespace Rcpp;
//countBits function taken from https://en.wikipedia.org/wiki/Hamming_weight#Efficient_implementation
const uint64_t m1 = 0x5555555555555555; //binary: 0101...
const uint64_t m2 = 0x3333333333333333; //binary: 00110011..
const uint64_t m4 = 0x0f0f0f0f0f0f0f0f; //binary: 4 zeros, 4 ones ...
const uint64_t h01 = 0x0101010101010101; //the sum of 256 to the power of 0,1,2,3...
int countBits(uint64_t x) {
x -= (x >> 1) & m1; //put count of each 2 bits into those 2 bits
x = (x & m2) + ((x >> 2) & m2); //put count of each 4 bits into those 4 bits
x = (x + (x >> 4)) & m4; //put count of each 8 bits into those 8 bits
return (x * h01)>>56; //returns left 8 bits of x + (x<<8) + (x<<16) + (x<<24) + ...
}
// [[Rcpp::export]]
int countBitsFromRaw(RawVector rv) {
uint64_t* x = (uint64_t*)RAW(rv);
return(countBits(*x));
}
// [[Rcpp::export]]
NumericVector bDist(RawMatrix mat) {
int nr(mat.nrow()), nc(mat.ncol());
int nw = nr / 8;
NumericVector res(nc * (nc - 1) / 2);
// Access the raw data as unsigned 64 bit integers
uint64_t* data = (uint64_t*)RAW(mat);
uint64_t a(0);
// Work through each possible combination of columns (rows in the original integer matrix)
for (int i = 0; i < nc - 1; i++) {
for (int j = i + 1; j < nc; j++) {
uint64_t sx = 0;
uint64_t so = 0;
// Work through each 64 bit integer and calculate the sum of (x ^ y) and (x | y)
for (int k = 0; k < nw; k++) {
uint64_t o = data[nw * i + k] | data[nw * j + k];
// If (x | y == 0) then (x ^ y) will also be 0
if (o) {
// Use Hamming weight method to calculate number of set bits
so = so + countBits(o);
uint64_t x = data[nw * i + k] ^ data[nw * j + k];
if (x) {
sx = sx + countBits(x);
}
}
}
res(a++) = (double)sx / so;
}
}
return (res);
}
library("Rcpp")
library("plyr")
sourceCpp("bDist.cpp")
# Converts a binary integer vector into a packed raw vector,
# padding out at the end to make the input length a multiple of packWidth
packRow <- function(row, packWidth = 64L) {
packBits(as.raw(c(row, rep(0, (packWidth - length(row)) %% packWidth))))
}
as.PackedMatrix <- function(x, packWidth = 64L) {
UseMethod("as.PackedMatrix")
}
# Converts a binary integer matrix into a packed raw matrix
# padding out at the end to make the input length a multiple of packWidth
as.PackedMatrix.matrix <- function(x, packWidth = 64L) {
stopifnot(packWidth %% 8 == 0, class(x) %in% c("matrix", "Matrix"))
storage.mode(x) <- "raw"
if (ncol(x) %% packWidth != 0) {
x <- cbind(x, matrix(0L, nrow = nrow(x), ncol = packWidth - (ncol(x) %% packWidth)))
}
out <- packBits(t(x))
dim(out) <- c(ncol(x) %/% 8, nrow(x))
class(out) <- "PackedMatrix"
out
}
# Converts back to an integer matrix
as.matrix.PackedMatrix <- function(x) {
out <- rawToBits(x)
dim(out) <- c(nrow(x) * 8L, ncol(x))
storage.mode(out) <- "integer"
t(out)
}
# Generates random sparse data for testing the main function
makeRandomData <- function(nObs, nVariables, maxBits, packed = FALSE) {
x <- replicate(nObs, {
y <- integer(nVariables)
y[sample(nVariables, sample(maxBits, 1))] <- 1L
if (packed) {
packRow(y, 64L)
} else {
y
}
})
if (packed) {
class(x) <- "PackedMatrix"
x
} else {
t(x)
}
}
# Reads a binary matrix from file or character vector
# Borrows the first bit of code from read.table
readPackedMatrix <- function(file = NULL, text = NULL, packWidth = 64L) {
if (missing(file) && !missing(text)) {
file <- textConnection(text)
on.exit(close(file))
}
if (is.character(file)) {
file <- file(file, "rt")
on.exit(close(file))
}
if (!inherits(file, "connection"))
stop("'file' must be a character string or connection")
if (!isOpen(file, "rt")) {
open(file, "rt")
on.exit(close(file))
}
lst <- list()
i <- 1
while(length(line <- readLines(file, n = 1)) > 0) {
lst[[i]] <- packRow(as.integer(strsplit(line, "", fixed = TRUE)[[1]]), packWidth = packWidth)
i <- i + 1
}
out <- do.call("cbind", lst)
class(out) <- "PackedMatrix"
out
}
# Wrapper for the C++ code which
binaryDist <- function(x) {
if (class(x) != "PackedMatrix") {
x <- as.PackedMatrix(x)
}
dst <- bDist(x)
attr(dst, "Size") <- ncol(x)
attr(dst, "Diag") <- attr(dst, "Upper") <- FALSE
attr(dst, "method") <- "binary"
attr(dst, "call") <- match.call()
class(dst) <- "dist"
dst
}
x <- makeRandomData(2000, 400, maxBits = 5, packed = TRUE)
system.time(bd <- binaryDist(x))
From original answer:
Other things to consider would be doing some prefiltering of comparisons between two rows with single ones since the distance will either be 0 for duplicates or 1 for any other possibility.
A couple of relatively straightforward options that might be faster without needing much code are the vegdist
function from the vegan package and the Dist
function from the amap package. The latter will probably only be quicker if you have multiple cores and take advantage of the fact it supports parallelisation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With