I am trying to merge
several data.frames
into one data.frame
. Since I have a whole list of files I am trying to do it with a loop structure.
So far the loop approach works fine. However, it looks pretty inefficient and I am wondering if there is a faster and easier approach.
Here is the scenario: I have a directory with several .csv
files. Each file contains the same identifier which can be used as the merger variable. Since the files are rather large in size I thought to read each file one at a time into R instead of reading all files at once. So I get all the files of the directory with list.files
and read in the first two files. Afterwards I use merge
to get one data.frame
.
FileNames <- list.files(path=".../tempDataFolder/") FirstFile <- read.csv(file=paste(".../tempDataFolder/", FileNames[1], sep=""), header=T, na.strings="NULL") SecondFile <- read.csv(file=paste(".../tempDataFolder/", FileNames[2], sep=""), header=T, na.strings="NULL") dataMerge <- merge(FirstFile, SecondFile, by=c("COUNTRYNAME", "COUNTRYCODE", "Year"), all=T)
Now I use a for
loop to get all the remaining .csv
files and merge
them into the already existing data.frame
:
for(i in 3:length(FileNames)){ ReadInMerge <- read.csv(file=paste(".../tempDataFolder/", FileNames[i], sep=""), header=T, na.strings="NULL") dataMerge <- merge(dataMerge, ReadInMerge, by=c("COUNTRYNAME", "COUNTRYCODE", "Year"), all=T) }
Even though it works just fine I was wondering if there is a more elegant way to get the job done?
The concat() function can be used to concatenate two Dataframes by adding the rows of one to the other. The merge() function is equivalent to the SQL JOIN clause. 'left', 'right' and 'inner' joins are all possible.
The Pandas merge() command takes the left and right dataframes, matches rows based on the “on” columns, and performs different types of merges – left, right, etc.
You may want to look at the closely related question on stackoverflow.
I would approach this in two steps: import all the data (with plyr
), then merge it together:
filenames <- list.files(path=".../tempDataFolder/", full.names=TRUE) library(plyr) import.list <- llply(filenames, read.csv)
That will give you a list of all the files that you now need to merge together. There are many ways to do this, but here's one approach (with Reduce
):
data <- Reduce(function(x, y) merge(x, y, all=T, by=c("COUNTRYNAME", "COUNTRYCODE", "Year")), import.list, accumulate=F)
Alternatively, you can do this with the reshape
package if you aren't comfortable with Reduce
:
library(reshape) data <- merge_recurse(import.list)
If I'm not mistaken, a pretty simple change could eliminate the 3:length(FileNames)
kludge:
FileNames <- list.files(path=".../tempDataFolder/", full.names=TRUE) dataMerge <- data.frame() for(f in FileNames){ ReadInMerge <- read.csv(file=f, header=T, na.strings="NULL") dataMerge <- merge(dataMerge, ReadInMerge, by=c("COUNTRYNAME", "COUNTRYCODE", "Year"), all=T) }
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With