I have a folder containing a bunch of CSV files that are titled "yob1980", "yob1981", "yob1982" etc.
I have to use a for loop to go through each file and put its contents into a data frame - the columns in the data frame should be "1980", "1981", "1982" etc
Here is what I have:
file_list <- list.files()
temp = list.files(pattern="*.txt")
babynames <- do.call(rbind,lapply(temp,read.csv, FALSE))
names(babynames) <- c("Name", "Gender", "Count")
I feel like I need a for loop, but I'm not sure how to loop through the files. Anyone point me in the right direction?
My favourite way to do this is using ldply
from the plyr
package. It has the advantage of returning a dataframe, so you don't need to do the rbind step afterwards:
library( plyr )
babynames <- ldply( .data = list.files(pattern="*.txt"),
.fun = read.csv,
header = FALSE,
col.names=c("Name", "Gender", "Count") )
As an added benefit, you can multi-thread the import very easily, making importing large multi-file datasets quite a bit faster:
library( plyr )
library( doMC )
registerDoMC( cores = 4 )
babynames <- ldply( .data = list.files(pattern="*.txt"),
.fun = read.csv,
header = FALSE,
col.names=c("Name", "Gender", "Count"),
.parallel = TRUE )
Changing the above slightly to include a Year
column in the resulting data frame, you can create a function first, then execute that function within ldply
in the same way you would execute read.csv
readFun <- function( filename ) {
# read in the data
data <- read.csv( filename,
header = FALSE,
col.names = c( "Name", "Gender", "Count" ) )
# add a "Year" column by removing both "yob" and ".txt" from file name
data$Year <- gsub( "yob|.txt", "", filename )
return( data )
}
# execute that function across all files, outputting a data frame
doMC::registerDoMC( cores = 4 )
babynames <- plyr::ldply( .data = list.files(pattern="*.txt"),
.fun = readFun,
.parallel = TRUE )
This will give you your data in a concise and tidy way, which is how I'd recommend moving forward from here. While it is possible to then separate each year's data into it's own column, it's likely not the best way to go.
Note: depending on your preference, it may be a good idea to convert the Year
column to say, integer
class. But that's up to you.
Using purrr
library(tidyverse)
files <- list.files(path = "./data/", pattern = "*.csv")
df <- files %>%
map(function(x) {
read.csv(paste0("./data/", x))
}) %>%
reduce(rbind)
Consider an anonymous function within an lapply()
:
files = list.files(pattern="*.txt")
dfList <- lapply(files, function(i) {
df <- read.csv(i, header=FALSE, col.names=c("Name", "Gender", "Count"))
df$Year <- gsub("yob", "", i)
return(df)
})
finaldf <- do.call(rbind, dflist)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With