Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R - Import & Merge Multiple Excel Files And Add Filesource Variable

I have used R for various things over the past year but due to the number of packages and functions available, I am still sadly a beginner. I believe R would allow me to do what I want to do with minimal code, but I am struggling.

What I want to do:

I have roughly a hundred different excel files containing data on students. Each excel file represents a different school but contains the same variables. I need to:

  1. Import the data into R from Excel
  2. Add a variable to each file containing the filename
  3. Merge all of the data (add observations/rows - do not need to match on variables)

I will need to do this for multiple sets of data, so I am trying to make this as simple and easy to replicate as possible.

What the Data Look Like:

Row 1    Title
Row 2    StudentID    Var1    Var2            Var3      Var4        Var5
Row 3    11234           1         9/8/2011      343      159-167      32
Row 4    11235           2         9/16/2011    112      152-160      12
Row 5    11236           1         9/8/2011      325      164-171      44

Row 1 is meaningless and Row 2 contains the variable names. The files have different numbers of rows.

What I have so far:

At first I simply tried to import data from excel. Using the XLSX package, this works nicely:

dat <- read.xlsx2("FILENAME.xlsx", sheetIndex=1, 
                  sheetName=NULL, startRow=2, 
                  endRow=NULL, as.data.frame=TRUE, 
                  header=TRUE)

Next, I focused on figuring out how to merge the files (also thought this is where I should add the filename variable to the datafiles). This is where I got stuck.

setwd("FILE_PATH_TO_EXCEL_DIRECTORY")
filenames <- list.files(pattern=".xls")
do.call("rbind", lapply(filenames, read.xlsx2, sheetIndex=1, colIndex=6, header=TRUE, startrow=2, FILENAMEVAR=filenames));

I set my directory, make a list of all the excel file names in the folder, and then try to merge them in one statement using the a variable for the filenames.

When I do this I get the following error:

Error in data.frame(res, ...) : arguments imply differing number of rows: 616, 1, 5

I know there is a problem with my application of lapply - the startrow is not being recognized as an option and the FILENAMEVAR is trying to merge the list of 5 sample filenames as opposed to adding a column containing the filename.

What next?

If anyone can refer me to a useful resource or function, critique what I have so far, or point me in a new direction, it would be GREATLY appreciated!

like image 714
ESmith5988 Avatar asked Oct 07 '22 03:10

ESmith5988


1 Answers

I'll post my comment (with bdemerast picking up on the typo). The solution was untested as xlsx will not run happily on my machine

You need to pass a single FILENAMEVAR to read.xlsx2.

lapply(filenames, function(x) read.xlsx2(file=x, sheetIndex=1, colIndex=6, header=TRUE, startRow=2, FILENAMEVAR=x))
like image 185
mnel Avatar answered Oct 10 '22 02:10

mnel