Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read binary data into R efficiently

Tags:

dataframe

r

From a text file I'm reading in binary data structured like this:

0101010100101010101010101010
1010101001010101010101010111
1111101010101010100101010101

The file has 800 lines. Every line is equally long (but that varies between files, so it doesn't make sense to hard code it). I want the input stored in a data frame, in which every line is a row, and every two numbers are stored in different columns, for example:

col1 col2 col3 col4
0      1    0    1

Currently I am doing it like this

as.matrix(read.table(text=gsub("", ' ', readLines("input"))))->g

However, that takes too long as there are roughly 70,000 0/1's in each line.

Is there a quicker way to do this?

like image 223
heinheo Avatar asked Dec 06 '22 21:12

heinheo


1 Answers

You could pipe with awk

read.table(pipe("awk '{gsub(/./,\"& \", $1);print $1}' yourfile.txt"))
#   V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21
#1  0  1  0  1  0  1  0  1  0   0   1   0   1   0   1   0   1   0   1   0   1
#2  1  0  1  0  1  0  1  0  0   1   0   1   0   1   0   1   0   1   0   1   0
#3  1  1  1  1  1  0  1  0  1   0   1   0   1   0   1   0   1   0   0   1   0
#  V22 V23 V24 V25 V26 V27 V28
#1   0   1   0   1   0   1   0
#2   1   0   1   0   1   1   1
#3   1   0   1   0   1   0   1

Or

read.table(pipe("awk '{gsub(\"\",\" \", $1);print $1}' yourfile.txt"))

fread can also be combined with awk

library(data.table)
fread("awk '{gsub(/./,\"&,\", $1);print $1}' yourfile.txt")

Using a similar dataset as the OP's dataset,

library(stringi)
write.table(stri_rand_strings(800,70000, '[0-1]'), file='binary1.txt',
         row.names=FALSE, quote=FALSE, col.names=FALSE)

system.time(fread("awk '{gsub(/./,\"&,\", $1);print $1}' binary1.txt"))
#  user  system elapsed 
#16.444   0.108  16.542 
like image 115
akrun Avatar answered Dec 27 '22 05:12

akrun