I have 20 different .csv files and I need to some how stack the data in R so that I can get an overall picture of the data. Presently I am copying and pasting the columns in excel to make one big data set. However, I am sure there is a quicker and more efficient way of doing this in R as this would ultimately take a while.
Also, to make things worse some of the variable names are not the same in each data set. eg VARIABLE1 is written as variable1 in some datasets. How would i rectify this in R as I understand that R is case sensitive?
Any help would be greatly appreciated. Thanks!
The easiest and the fastest way to do this, if you're (or wish you to be) familiar with data.table
package is this way (not tested):
require(data.table)
in_pth <- "path_to_csv_files" # directory where CSV files are located, not the files.
files <- list.files(in_pth, full.names=TRUE, recursive=FALSE, pattern="\\.csv$")
out <- rbindlist(lapply(files, fread))
list.files
parameters:full.names = TRUE
will return the full path to your file. Suppose your in_pth <- "c:\\my_csv_folder"
and inside this you've two files: 01.csv and 02.csv
. Then, full.names=TRUE
will return c:\\my_csv_folder\\01.csv
and c:\\my_csv_folder\\02.csv
(full path).
recursive = FALSE
will not search inside directories within your in_pth
folder. Assume you've two more csv files in c:\\my_csv_folder\\another_folder
. Now, if you want to load these files inside this one, then you can set recursive=TRUE
, which'll scan for files until you reach all directories searching down.
pattern=\\.csv$
: This is a regular expression to tell which sort of files to load. If your folder, in addition to csv files also has text files (.txt), then by specifying this pattern, you'll load only the csv
files. If your folder has only CSV files, then this is not necessary.
rbindlist
avoids conflict in column names by retaining the name of the previous data.table. That is, if you've two data.table
s dt1, dt2
with column names x,y
and a,b
respectively, then doing rbindlist(dt1,dt2)
will take care of changing a,b
to x,y
and rbindlist(dt2, dt1)
will take care of changing x,y
to a,b
.
fread
takes care of columns, headers separators etc most often automatically.. and is extremely fast (although still experimental, so you may want to check your output to be sure it's all fine (even if stable)).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With