Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read all files in a folder and apply a function to each data frame

I am doing a relatively simple piece of analysis that I have put into a function on all the files in a particular folder. I was wondering whether anyone had any tips to help me automate the process on a number of different folders.

  1. Firstly, I was wondering whether there was a way of reading all the files in a particular folder straight into R. I believe the following command will list all the files:

files <- (Sys.glob("*.csv"))

...which I found from Using R to list all files with a specified extension

And then the following code reads all those files into R.

listOfFiles <- lapply(files, function(x) read.table(x, header = FALSE))  

…from Manipulating multiple files in R

But the files seem to be read in as one continuous list and not individual files… how can I change the script to open all the csv files in a particular folder as individual dataframes?

  1. Secondly, assuming that I can read all the files in separately, how do I complete a function on all these dataframes in one go. For example, I have created four small dataframes so I can illustrate what I want:

     Df.1 <- data.frame(A = c(5,4,7,6,8,4),B = (c(1,5,2,4,9,1)))  Df.2 <- data.frame(A = c(1:6),B = (c(2,3,4,5,1,1)))  Df.3 <- data.frame(A = c(4,6,8,0,1,11),B = (c(7,6,5,9,1,15)))  Df.4 <- data.frame(A = c(4,2,6,8,1,0),B = (c(3,1,9,11,2,16))) 

I have also made up an example function:

Summary<-function(dfile){ SumA<-sum(dfile$A) MinA<-min(dfile$A) MeanA<-mean(dfile$A) MedianA<-median(dfile$A) MaxA<-max(dfile$A)  sumB<-sum(dfile$B) MinB<-min(dfile$B) MeanB<-mean(dfile$B) MedianB<-median(dfile$B) MaxB<-max(dfile$B)  Sum<-c(sumA,sumB) Min<-c(MinA,MinB) Mean<-c(MeanA,MeanB) Median<-c(MedianA,MedianB) Max<-c(MaxA,MaxB) rm(sumA,sumB,MinA,MinB,MeanA,MeanB,MedianA,MedianB,MaxA,MaxB)  Label<-c("A","B") dfile_summary<-data.frame(Label,Sum,Min,Mean,Median,Max) return(dfile_summary)} 

I would ordinarily use the following command to apply the function to each individual dataframe.

Df1.summary<-Summary(dfile)

Is there a way instead of applying the function to all the dataframes, and use the titles of the dataframes in the summary tables (i.e. Df1.summary).

Many thanks,

Katie

like image 581
KT_1 Avatar asked Mar 05 '12 09:03

KT_1


People also ask

How do I read all files in a folder in R?

To list all files in a directory in R programming language we use list. files(). This function produces a list containing the names of files in the named directory. It returns a character vector containing the names of the files in the specified directories.

How do I read all files in a directory in Python?

os. listdir() method in python is used to get the list of all files and directories in the specified directory. If we don't specify any directory, then list of files and directories in the current working directory will be returned.

How do I get a list of files in a folder in R?

The list. dirs() method in R language is used to retrieve a list of directories present within the path specified. The output returned is in the form of a character vector containing the names of the files contained in the specified directory path, or returns null if no directories were returned.


1 Answers

On the contrary, I do think working with list makes it easy to automate such things.

Here is one solution (I stored your four dataframes in folder temp/).

filenames <- list.files("temp", pattern="*.csv", full.names=TRUE) ldf <- lapply(filenames, read.csv) res <- lapply(ldf, summary) names(res) <- substr(filenames, 6, 30) 

It is important to store the full path for your files (as I did with full.names), otherwise you have to paste the working directory, e.g.

filenames <- list.files("temp", pattern="*.csv") paste("temp", filenames, sep="/") 

will work too. Note that I used substr to extract file names while discarding full path.

You can access your summary tables as follows:

> res$`df4.csv`        A              B          Min.   :0.00   Min.   : 1.00    1st Qu.:1.25   1st Qu.: 2.25    Median :3.00   Median : 6.00    Mean   :3.50   Mean   : 7.00    3rd Qu.:5.50   3rd Qu.:10.50    Max.   :8.00   Max.   :16.00   

If you really want to get individual summary tables, you can extract them afterwards. E.g.,

for (i in 1:length(res))   assign(paste(paste("df", i, sep=""), "summary", sep="."), res[[i]]) 
like image 146
chl Avatar answered Oct 14 '22 17:10

chl