Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Append new data to an existing dataframe (RDS) in R

I have an Rscript that is reading in a constant stream of data in the form of a flat file. Another script picks up this flat file, does some parsing and processing, then saves the result as a data.frame in RDS format. It then sleeps, and repeats the process.

saveRDS(tmp.df, file="H:/Documents/tweet.df.rds") #saving the data.frame

On the second... nth iteration, I have the code only process the new lines added to the flat file since the previous iteration. However, in order to append the delta lines to the permanent data frame, I have to read it in, append, and then save it back out, overwriting the original.

df2 <- readRDS("H:/Documents/tweet.df.rds") #read in permanent                      
tmp.df2 <- rbind(df2, tmp.df) #append new to existing
saveRDS(tmp.df2, file="H:/Documents/tweet.df.rds") #save it
rm(df2) #housecleaning
rm(tmp.df2) #housecleaning

This approach is risky because whenever the RDS is open for reading/writing, another process wanting to touch that file has to wait. As the base file gets bigger, the risk increases.

Is there something like an appendRDS (I know literally there isn't) that can achieve what I want- iterative updating of a single data frame- saved to a file- that uses appending rather than complete replacement?

like image 855
Amw 5G Avatar asked Dec 28 '12 21:12

Amw 5G


People also ask

How do I combine RDS files in R?

By using the map_dfr function from the purrr package, you can load and combine multiple RDS files. That will easily combine them by rows. You can create a list of file paths like this and do some R piping. Or, if you have it all in one place, you can do that by using the list.

What is readRDS?

saveRDS and readRDS provide the means to save a single R object to a connection (typically a file) and to restore the object, quite possibly under a different name. This differs from save and load , which save and restore one or more named objects into an environment.


Video Answer


2 Answers

I think you can safeguard your process by using connections, opening and closing it before the next process takes over.

con <- file("tmp.rds")
open(con)
df <- readRDS(con)
df.new <- rbind(df,df)
saveRDS(df.new, con)
close(con) 

Update:

You can test if a connection to the file is open and tell it to wait for a bit if you're having problems with concurrency.

while(is.Open(con)) { # untested but something of this nature should work
sys.Sleep(2)
}
like image 155
Brandon Bertelsen Avatar answered Oct 03 '22 14:10

Brandon Bertelsen


Is there anything wrong with using a series of numbered RDS files in a directory instead of a single RDS file? I don't think is is possible to append to a data frame an an RDS file without rewriting the entire file, since data frames are simply lists of columns, so presumably they are serialized one column at a time, so only the last column ends near the end of the file.

If you want to stick with a single file but minimize the risk of reading inconsistent data from a RDS file, you can read it in, do the append operation, and then write it out to a temp file and rename the temp file to the original name once it is finished. Then at least your period of risk is not dependent on the size of the file. I'm not familiar with what kind of atomicity is guaranteed by various filesystems when renaming a file to an existing name, but it's probably better than the time taken by saveRDS.

like image 41
Ryan C. Thompson Avatar answered Oct 03 '22 13:10

Ryan C. Thompson