From a text file I'm reading in binary data structured like this:
0101010100101010101010101010
1010101001010101010101010111
1111101010101010100101010101
The file has 800 lines. Every line is equally long (but that varies between files, so it doesn't make sense to hard code it). I want the input stored in a data frame, in which every line is a row, and every two numbers are stored in different columns, for example:
col1 col2 col3 col4
0 1 0 1
Currently I am doing it like this
as.matrix(read.table(text=gsub("", ' ', readLines("input"))))->g
However, that takes too long as there are roughly 70,000 0/1's in each line.
Is there a quicker way to do this?
You could pipe
with awk
read.table(pipe("awk '{gsub(/./,\"& \", $1);print $1}' yourfile.txt"))
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21
#1 0 1 0 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 1
#2 1 0 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 1 0
#3 1 1 1 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1 0
# V22 V23 V24 V25 V26 V27 V28
#1 0 1 0 1 0 1 0
#2 1 0 1 0 1 1 1
#3 1 0 1 0 1 0 1
Or
read.table(pipe("awk '{gsub(\"\",\" \", $1);print $1}' yourfile.txt"))
fread
can also be combined with awk
library(data.table)
fread("awk '{gsub(/./,\"&,\", $1);print $1}' yourfile.txt")
Using a similar dataset as the OP's dataset,
library(stringi)
write.table(stri_rand_strings(800,70000, '[0-1]'), file='binary1.txt',
row.names=FALSE, quote=FALSE, col.names=FALSE)
system.time(fread("awk '{gsub(/./,\"&,\", $1);print $1}' binary1.txt"))
# user system elapsed
#16.444 0.108 16.542
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With