Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using rbind() to combine multiple data frames into one larger data.frame within lapply()

I'm using R-Studio 0.99.491 and R version 3.2.3 (2015-12-10). I'm a relative newbie to R, and I'd appreciate some help. I'm doing a project where I'm trying to use the server logs on an old media server to identify which folders/files within the server are still being accessed and which aren't, so that my team knows which files to migrate. Each log is for a 24 hour period, and I have approximately a year's worth of logs, so in theory, I should be able to see all of the access over the past year.

My ideal output is to get a tree structure or plot that will show me the folders on our server that are being used. I've figured out how to read one log (one day) into R as a data.frame and then use the data.tree package in R to turn that into a tree. Now, I want to recursively go through all of the files in the directory, one by one, and add them to that original data.frame, before I create the tree. Here's my current code:

#Create the list of log files in the folder
files <- list.files(pattern = "*.log", full.names = TRUE, recursive = FALSE)
#Create a new data.frame to hold the aggregated log data
uridata <- data.frame()
#My function to go through each file, one by one, and add it to the 'uridata' df, above
lapply(files, function(x){
    uriraw <- read.table(x, skip = 3, header = TRUE, stringsAsFactors = FALSE)
    #print(nrow(uriraw)
    uridata <- rbind(uridata, uriraw)
    #print(nrow(uridata))
})

The problem is that, no matter what I try, the value of 'uridata' within the lapply loop seems to not be saved/passed outside of the lapply loop, but is somehow being overwritten each time the loop runs. So instead of getting one big data.frame, I just get the contents of the last 'uriraw' file. (That's why there are those two commented print commands inside the loop; I was testing how many lines there were in the data frames each time the loop ran.)

Can anyone clarify what I'm doing wrong? Again, I'd like one big data.frame at the end that combines the contents of each of the (currently seven) log files in the folder.

like image 893
John Lynch Avatar asked Nov 29 '22 23:11

John Lynch


2 Answers

do.call() is your friend.

big.list.of.data.frames <- lapply(files, function(x){
    read.table(x, skip = 3, header = TRUE, stringsAsFactors = FALSE)
})

or more concisely (but less-tinkerable):

big.list.of.data.frames <- lapply(files, read.table, 
                                  skip = 3,header = TRUE,
                                  stringsAsFactors = FALSE)

Then:

big.data.frame <- do.call(rbind,big.list.of.data.frames)

This is a recommended way to do things because "growing" a data frame dynamically in R is painful. Slow and memory-expensive, because a new frame gets built at each iteration.

like image 70
Jason Avatar answered Dec 04 '22 02:12

Jason


You can use map_df from purrr package instead of lapply, to directly have all results combined as a data frame.

map_df(files, read.table, skip = 3, header = TRUE, stringsAsFactors = FALSE)
like image 31
Ricky Avatar answered Dec 04 '22 01:12

Ricky