I have an Rscript that is reading in a constant stream of data in the form of a flat file. Another script picks up this flat file, does some parsing and processing, then saves the result as a data.frame in RDS format. It then sleeps, and repeats the process. <pre class="prettyprint"><code>saveRDS(tmp.df, file="H:/Documents/tweet.df.rds") #saving the data.frame </code></pre> On the second... nth iteration, I have the code only process the new lines added to the flat file since the previous iteration. However, in order to append the delta lines to the permanent data frame, I have to read it in, append, and then save it back out, overwriting the original. <pre class="prettyprint"><code>df2 <- readRDS("H:/Documents/tweet.df.rds") #read in permanent tmp.df2 <- rbind(df2, tmp.df) #append new to existing saveRDS(tmp.df2, file="H:/Documents/tweet.df.rds") #save it rm(df2) #housecleaning rm(tmp.df2) #housecleaning </code></pre> This approach is risky because whenever the RDS is open for reading/writing, another process wanting to touch that file has to wait. As the base file gets bigger, the risk increases. Is there something like an <code>appendRDS</code> (I know literally there isn't) that can achieve what I want- iterative updating of a single data frame- saved to a file- that uses appending rather than complete replacement?

I think you can safeguard your process by using connections, opening and closing it before the next process takes over. <pre class="prettyprint"><code>con <- file("tmp.rds") open(con) df <- readRDS(con) df.new <- rbind(df,df) saveRDS(df.new, con) close(con) </code></pre> Update: You can test if a connection to the file is open and tell it to wait for a bit if you're having problems with concurrency. <pre class="prettyprint"><code>while(is.Open(con)) { # untested but something of this nature should work sys.Sleep(2) } </code></pre>

Append new data to an existing dataframe (RDS) in R

Tags:

stream

dataframe

append

r

I have an Rscript that is reading in a constant stream of data in the form of a flat file. Another script picks up this flat file, does some parsing and processing, then saves the result as a data.frame in RDS format. It then sleeps, and repeats the process.

saveRDS(tmp.df, file="H:/Documents/tweet.df.rds") #saving the data.frame

On the second... nth iteration, I have the code only process the new lines added to the flat file since the previous iteration. However, in order to append the delta lines to the permanent data frame, I have to read it in, append, and then save it back out, overwriting the original.

df2 <- readRDS("H:/Documents/tweet.df.rds") #read in permanent                      
tmp.df2 <- rbind(df2, tmp.df) #append new to existing
saveRDS(tmp.df2, file="H:/Documents/tweet.df.rds") #save it
rm(df2) #housecleaning
rm(tmp.df2) #housecleaning

This approach is risky because whenever the RDS is open for reading/writing, another process wanting to touch that file has to wait. As the base file gets bigger, the risk increases.

Is there something like an appendRDS (I know literally there isn't) that can achieve what I want- iterative updating of a single data frame- saved to a file- that uses appending rather than complete replacement?

855

asked Dec 28 '12 21:12

Amw 5G

Video Answer

2 Answers

I think you can safeguard your process by using connections, opening and closing it before the next process takes over.

con <- file("tmp.rds")
open(con)
df <- readRDS(con)
df.new <- rbind(df,df)
saveRDS(df.new, con)
close(con)

Update:

You can test if a connection to the file is open and tell it to wait for a bit if you're having problems with concurrency.

while(is.Open(con)) { # untested but something of this nature should work
sys.Sleep(2)
}

155

answered Oct 03 '22 14:10

Brandon Bertelsen

Is there anything wrong with using a series of numbered RDS files in a directory instead of a single RDS file? I don't think is is possible to append to a data frame an an RDS file without rewriting the entire file, since data frames are simply lists of columns, so presumably they are serialized one column at a time, so only the last column ends near the end of the file.

If you want to stick with a single file but minimize the risk of reading inconsistent data from a RDS file, you can read it in, do the append operation, and then write it out to a temp file and rename the temp file to the original name once it is finished. Then at least your period of risk is not dependent on the size of the file. I'm not familiar with what kind of atomicity is guaranteed by various filesystems when renaming a file to an existing name, but it's probably better than the time taken by saveRDS.

answered Oct 03 '22 13:10

Ryan C. Thompson

Related questions
                            
                                How to remove training data from party:::ctree models?
                            
                                Why does Sweave throw an error on LaTeX code that has been commented out?
                            
                                Data dictionary packing in R
                            
                                Wrapping R's plot function (or ggplot2) to prevent plotting of large data sets
                            
                                Using plyr, doMC, and summarise() with very big dataset?
                            
                                Making use of swap partition in R
                            
                                Mapping the world on ggplot2
                            
                                Is there an R function that implements the finite element method?
                            
                                How can I add second axis labels in ggplot2?
                            
                                Command-line program to update R Markdown code to use `$latex` delimter
                            
                                Building R package: no visible global function definition for 'subject'
                            
                                How to access a bash environment variable from within R in emacs-ess
                            
                                How to convert NAD 83 coordinates to latitude and longitude with rgdal package?
                            
                                Testing rules generated by Rpart package
                            
                                time zones in POSIXct and xts, converting from GMT in R
                            
                                profile confidence intervals in R: mle2
                            
                                How to search for equal variables in rows (in a smart way) and store according rows as subsets?
                            
                                rbind data.frames without names
                            
                                how to get rJava 0.9-3 to work on OS X 10.7.4 with Oracle Java 1.7?
                            
                                install.packages errors: Troubleshooting local repo usage

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With