Hi I have a lot of CSV files to process. Each file is generated by a run of an algorithm. My data always has one key and a value like this:
csv1:
index value
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
csv2:
index value
1 4 3
2 5 3
3 6 3
4 7 3
5 8 3
Now I want to aggregate these CSV data, like this:
When both files contain an identical key e.g. 5, the resulting row should contain the key both files share (5) and the mean of both values ((1+3)/2 = 2). If only one file contains a key (e.g. 2), this row is just added to the result table (key = 2, value = 1).
Something like this:
index value
1 1 1
2 2 1
3 3 1
4 4 2 (as (1+4)/2 = 2)
5 5 2 (as (1+4)/2 = 2)
6 6 3
7 7 3
8 8 3
At first I thought rbind()
does the job, but it does not aggregate the values, only concatenates the data. How can I achieve that with R?
In R we use merge() function to merge two dataframes in R. This function is present inside join() function of dplyr package. The most important condition for joining two dataframes is that the column type should be the same on which the merging happens. merge() function works similarly like join in DBMS.
Join Multiple R DataFrames To join more than two (multiple) R dataframes, then reduce() is used. It is available in the tidyverse package which will convert all the dataframes to a list and join the dataframes based on the column. It is similar to the above joins.
To combine two data frames with same columns in R language, call rbind() function, and pass the two data frames, as arguments. rbind() function returns the resulting data frame created from concatenating the given two data frames. For rbind() function to combine the given data frames, the column names must match.
The concat() function can be used to concatenate two Dataframes by adding the rows of one to the other. The merge() function is equivalent to the SQL JOIN clause. 'left', 'right' and 'inner' joins are all possible.
Here is a solution. I am following all the excellent comments so far, and hopefully adding value by showing you how to handle any number of files. I am assuming you have all your csv files in the same directory (my.csv.dir
below).
# locate the files
files <- list.files(my.csv.dir)
# read the files into a list of data.frames
data.list <- lapply(files, read.csv)
# concatenate into one big data.frame
data.cat <- do.call(rbind, data.list)
# aggregate
data.agg <- aggregate(value ~ index, data.cat, mean)
Edit: to handle your updated question in your comment below:
files <- list.files(my.csv.dir)
algo.name <- sub("-.*", "", files)
data.list <- lapply(files, read.csv)
data.list <- Map(transform, data.list, algorithm = algo.name)
data.cat <- do.call(rbind, data.list)
data.agg <- aggregate(value ~ algorithm + index, data.cat, mean)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With