Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I stack data in R?

Tags:

merge

r

dataset

I have 20 different .csv files and I need to some how stack the data in R so that I can get an overall picture of the data. Presently I am copying and pasting the columns in excel to make one big data set. However, I am sure there is a quicker and more efficient way of doing this in R as this would ultimately take a while.

Also, to make things worse some of the variable names are not the same in each data set. eg VARIABLE1 is written as variable1 in some datasets. How would i rectify this in R as I understand that R is case sensitive?

Any help would be greatly appreciated. Thanks!

like image 702
REnthusiast Avatar asked Nov 03 '22 19:11

REnthusiast


1 Answers

The easiest and the fastest way to do this, if you're (or wish you to be) familiar with data.table package is this way (not tested):

require(data.table)
in_pth <- "path_to_csv_files" # directory where CSV files are located, not the files.
files <- list.files(in_pth, full.names=TRUE, recursive=FALSE, pattern="\\.csv$")
out <- rbindlist(lapply(files, fread))

list.files parameters:

  • full.names = TRUE will return the full path to your file. Suppose your in_pth <- "c:\\my_csv_folder" and inside this you've two files: 01.csv and 02.csv. Then, full.names=TRUE will return c:\\my_csv_folder\\01.csv and c:\\my_csv_folder\\02.csv (full path).

  • recursive = FALSE will not search inside directories within your in_pth folder. Assume you've two more csv files in c:\\my_csv_folder\\another_folder. Now, if you want to load these files inside this one, then you can set recursive=TRUE, which'll scan for files until you reach all directories searching down.

  • pattern=\\.csv$: This is a regular expression to tell which sort of files to load. If your folder, in addition to csv files also has text files (.txt), then by specifying this pattern, you'll load only the csv files. If your folder has only CSV files, then this is not necessary.


data.table functions:

  • rbindlist avoids conflict in column names by retaining the name of the previous data.table. That is, if you've two data.tables dt1, dt2 with column names x,y and a,b respectively, then doing rbindlist(dt1,dt2) will take care of changing a,b to x,y and rbindlist(dt2, dt1) will take care of changing x,y to a,b.

  • fread takes care of columns, headers separators etc most often automatically.. and is extremely fast (although still experimental, so you may want to check your output to be sure it's all fine (even if stable)).

like image 168
Arun Avatar answered Nov 09 '22 08:11

Arun