Suppose I have a gigantic m*n matrix X
(that is too big to read into memory) and binary numeric vector V
with length m
. My objective is to read the rows of X
that correspond to V
equalling 1
(and not those corresponding to V[i] == 0
) into a dedicated data table
/matrix
through a package such as (but not necessarily identical to) bigmemory
or ff
. only for the rows corresponding to V[i] == 1
.
This can be done by hacking nrows
and skip
and so on in read.table
but I'm looking for a bigmemory
, ff
et al. type solution due to insufficient RAM.
Here's a MWE that does not reflect the true size of my X
.
X <- array(rnorm(100*5),dim=c(100,5))
write.csv(X,"target.csv")
V <- sample(c(rep(1,50),rep(0,50))) #Only want to read in half the rows corresponding to 1's
rm(X)
#Now ... How to read "target.csv"?
How about you use the command line tool sed
, constructing a command that passes along the lines you want to read in the command. I am not sure if there would be some command length limit on this...
# Check the data
head( X )
# [,1] [,2] [,3] [,4] [,5]
#[1,] 0.2588798 0.42229528 0.4469073 1.0684309 1.35519389
#[2,] 1.0267562 0.80299223 -0.2768111 -0.7017247 -0.06575137
#[3,] 1.0110365 -0.36998260 -0.8543176 1.6237827 -1.33320291
#[4,] 1.5968757 2.13831188 0.6978655 -0.5697239 -1.53799156
#[5,] 0.1284392 0.55596342 0.6919573 0.6558735 -1.69494827
#[6,] -0.2406540 -0.04807381 -1.1265165 -0.9917737 0.31186670
# Check V, note row 6 above should be skipped according to this....
head(V)
# [1] 1 1 1 1 1 0
# Get line numbers we want to read
head( which( V == 1 ) )
# [1] 1 2 3 4 5 7
# Read the first 5 lines where V == '1' in your example (remembering to include an extra line for the header row, hence the +1 in 'which()')
read.csv( pipe( paste0("sed -n '" , paste0( c( 1 , which( V == 1 )[1:6] + 1 ) , collapse = "p; " ) , "p' C:/Data/target.csv" , collapse = "" ) ) , head=TRUE)
# X V1 V2 V3 V4 V5
#1 1 0.2588798 0.4222953 0.4469073 1.0684309 1.35519389
#2 2 1.0267562 0.8029922 -0.2768111 -0.7017247 -0.06575137
#3 3 1.0110365 -0.3699826 -0.8543176 1.6237827 -1.33320291
#4 4 1.5968757 2.1383119 0.6978655 -0.5697239 -1.53799156
#5 5 0.1284392 0.5559634 0.6919573 0.6558735 -1.69494827
#6 7 0.6856038 0.1082029 0.1523561 -1.4147429 -0.64041290
The command we are actually passing to sed
is...
"sed -n '1p; 2p; 3p; 4p; 5p; 6p; 8p' C:/Data/target.csv"
We use -n
to turn off printing of any lines, and then we use a semi-colon separated vector of lines numbers that we do want to read, given to us by which( V == 1 )
, and finally the target filename. Remember these line numbers have been offset by +1
to account for the line that makes up the header row.
ffdfindexget
from ff
package is what you are looking for:
Function ffdfindexget allows to extract rows from an ffdf data.frame according to positive integer suscripts stored in an ff vector.
So in your example:
write.csv(X,"target.csv")
d <- read.csv.ffdf(file="target.csv")
i <- ff(which(V==1))
di <- ffdfindexget(d, i)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With