Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compute number of files per folder in a complex folder structure?

I have created a simple data.tree through importing a folder structure with files inside of it.

if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh("trinker/pathr")

library(pathr)
library(data.tree)

folder_structure <- pathr::tree(path = "/Users/username/Downloads/top_level/",
 use.data.tree = T, include.files = T)

Now, I would like to convert the object folder_structure into a data.frame with one row per folder and a column that specifies how many files each folder contains. How can I accomplish this?

For example, I have this very simply folder structure:

top_level_folder
    sub_folder_1
        file1.txt
    sub_folder_2
        file2.txt

Answering the question would involve creating an output that looks like this:

Folders             Files
top_level_folder    0
sub_folder_1        1
sub_folder_2        1

The first column can simply be generated through calling list.dirs("/Users/username/Downloads/top_level/"), but I don't know how to generate the second column. Note that the second column is non-recursive, meaning that files within subfolders are not counted (i.e. top_level_folder contains 0 files, even though the subfolders of top_level_folder contains 2 files).

If you want to see whether your solution scales or not, download the Rails codebase: https://github.com/rails/rails/archive/master.zip and try it on Rails' more complex file structure.

like image 482
histelheim Avatar asked Mar 09 '23 21:03

histelheim


1 Answers

list.dirs() provides a vector of every subdirectory reachable from a starting folder, so that handles the first column of your data-frame. Very convenient.

# Get a vector of all the directories and subdirectories from this folder
dir <- "."
xs <- list.dirs(dir, recursive = TRUE)

list.files() can tell us the contents of each of those folders, but it includes files and folders. We just want the files. To get the count of files, we need to filter the output of list.files() with a predicate. file.info() can tell us whether a given file is a directory or not, so we build our predicate from that.

# Helper to check if something is folder or file
is_dir <- function(x) file.info(x)[["isdir"]]
is_file <- Negate(is_dir)

Now, we solve how to get the number of files in a single folder. Summing boolean values returns the number of TRUE cases.

# Count the files in a single folder
count_files_in_one_dir <- function(dir) {
  files <- list.files(dir, full.names = TRUE)
  sum(is_file(files))
}

For convenience, we wrap that function to make it work on many folders.

# Vectorized version of the above
count_files_in_dir <- function(dir) {
  vapply(dir, count_files_in_one_dir, numeric(1), USE.NAMES = FALSE)
}

Now we can count the files.

df <- tibble::data_frame(
  dir = xs,
  nfiles = count_files_in_dir(xs))

df
#> # A tibble: 688 x 2
#>                                                  dir nfiles
#>                                                <chr>  <dbl>
#>  1                                                 .     11
#>  2                                         ./.github      3
#>  3                                     ./actioncable      7
#>  4                                 ./actioncable/app      0
#>  5                          ./actioncable/app/assets      0
#>  6              ./actioncable/app/assets/javascripts      1
#>  7 ./actioncable/app/assets/javascripts/action_cable      5
#>  8                                 ./actioncable/bin      1
#>  9                                 ./actioncable/lib      1
#> 10                    ./actioncable/lib/action_cable      8
#> # ... with 678 more rows
like image 61
TJ Mahr Avatar answered Mar 11 '23 13:03

TJ Mahr