Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Add "filename" column to table as multiple files are read and bound

Tags:

r

lapply

I have numerous csv files in multiple directories that I want to read into a R tribble or data.table. I use "list.files()" with the recursive argument set to TRUE to create a list of file names and paths, then use "lapply()" to read in multiple csv files, and then "bind_rows()" stick them all together:

filenames <- list.files(path, full.names = TRUE, pattern = fileptrn, recursive = TRUE)
tbl <- lapply(filenames, read_csv) %>% 
  bind_rows()

This approach works fine. However, I need to extract a substring from the each file name and add it as a column to the final table. I can get the substring I need with "str_extract()" like this:

sites <- str_extract(filenames, "[A-Z]{2}-[A-Za-z0-9]{3}")

I am stuck however on how to add the extracted substring as a column as lapply() runs through read_csv() for each file.

like image 220
kray Avatar asked Sep 19 '17 11:09

kray


2 Answers

tidyverse approach:

Update:

readr 2.0 (and beyond) now has built-in support for reading a list of files with the same columns into one output table in a single command. Just pass the filenames to be read in the same vector to the reading function. For example reading in csv files:

(files <- fs::dir_ls("D:/data", glob="*.csv"))
dat <- read_csv(files, id="path")

Alternatively using map_dfr with purrr: Add the filename using the .id = "source" argument in purrr::map_dfr() An example loading .csv files:

 # specify the directory, then read a list of files
  data_dir <- here("file/path")
  data_list <- fs::dir_ls(data_dir, regexp = ".csv$")

 # return a single data frame w/ purrr:map_dfr 
 my_data = data_list %>% 
    purrr::map_dfr(read_csv, .id = "source")
  
 # Alternatively, rename source from the file path to the file name
  my_data = data_list %>% 
    purrr::map_dfr(read_csv, .id = "source") %>% 
    dplyr::mutate(source = stringr::str_replace(source, "file/path", ""))
  
like image 171
derelict Avatar answered Nov 18 '22 05:11

derelict


I generally use the following approach, based on dplyr/tidyr:

data = tibble(File = files) %>%
    extract(File, "Site", "([A-Z]{2}-[A-Za-z0-9]{3})", remove = FALSE) %>%
    mutate(Data = lapply(File, read_csv)) %>%
    unnest(Data) %>%
    select(-File)
like image 10
Konrad Rudolph Avatar answered Nov 18 '22 03:11

Konrad Rudolph