Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R - Reading STDIN line by line

I want to stream a big data table into R LINE BY LINE, and if the current line has a specific condition (lets say the first columns is >15), add the line to a data frame in memory. I have written following code:

count<-1;
Mydata<-NULL;
fin <- FALSE;
while (!fin){
    if (count==1){
        Myrow=read.delim(pipe('cat /dev/stdin'), header=F,sep="\t",nrows=1);
        Mydata<-rbind(Mydata,Myrow);
        count<-count+1;
    }
    else {
        count<-count+1;
        Myrow=read.delim(pipe('cat /dev/stdin'), header=F,sep="\t",nrows=1);
        if (Myrow!=""){
        if (MyCONDITION){
            Mydata<-rbind(Mydata,Myrow);
        }
        }
        else
        {fin<-TRUE}
    }
}
print(Mydata);

But I get the error "data not available". Please note that my data is big and I don't want to read it all in once and apply my condition (in this case it was easy).

like image 430
user1250144 Avatar asked Mar 26 '12 11:03

user1250144


1 Answers

I think it would be wiser to use an R function like readLines. readLines supports only reading a specified number of lines, e.g. 1. Combine that with opening a file connection first, and then calling readLines repeatedly gets you what you want. When calling readLines multiple times, the next n lines are read from the connection. In R code:

stop = FALSE
f = file("/tmp/test.txt", "r")
while(!stop) {
  next_line = readLines(f, n = 1)
  ## Insert some if statement logic here
  if(length(next_line) == 0) {
    stop = TRUE
    close(f)
  }
}

Additional comments:

  • R has an internal way of treating stdin as file: stdin(). I suggest you use this instead of using pipe('cat /dev/stdin'). This probably makes it more robust, and definitely more cross-platform.
  • You initialize Mydata at the beginning and keep growing it using rbind. If the number of lines that you rbind becomes larger, this will get really slow. This has to do with the fact that when the object grows, the OS needs to find a new memory location for it, which ends up taking a lot of time. Better is to pre-allocate MyData, or use apply style loops.
like image 87
Paul Hiemstra Avatar answered Nov 05 '22 05:11

Paul Hiemstra