Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split large directory into subdirectories

I have a directory with about 2.5 million files and is over 70 GB.

I want to split this into subdirectories, each with 1000 files in them.

Here's the command I've tried using:

i=0; for f in *; do d=dir_$(printf %03d $((i/1000+1))); mkdir -p $d; mv "$f" $d; let i++; done

That command works for me on a small scale, but I can leave it running for hours on this directory and it doesn't seem to do anything.

I'm open for doing this in any way via command line: perl, python, etc. Just whatever way would be the fastest to get this done...

like image 528
Edward Avatar asked Dec 11 '22 17:12

Edward


2 Answers

I suspect that if you checked, you'd noticed your program was actually moving the files, albeit really slowly. Launching a program is rather expensive (at least compared to making a system call), and you do so three or four times per file! As such, the following should be much faster:

perl -e'
   my $base_dir_qfn = ".";
   my $i = 0;
   my $dir;
   opendir(my $dh, $base_dir_qfn)
      or die("Can'\''t open dir \"$base_dir_qfn\": $!\n");

   while (defined( my $fn = readdir($dh) )) {
      next if $fn =~ /^(?:\.\.?|dir_\d+)\z/;

      my $qfn = "$base_dir_qfn/$fn";

      if ($i % 1000 == 0) {
         $dir_qfn = sprintf("%s/dir_%03d", $base_dir_qfn, int($i/1000)+1);
         mkdir($dir_qfn)
            or die("Can'\''t make directory \"$dir_qfn\": $!\n");
      }

      rename($qfn, "$dir_qfn/$fn")
         or do {
            warn("Can'\''t move \"$qfn\" into \"$dir_qfn\": $!\n");
            next;
         };

      ++$i;
   }
'
like image 161
ikegami Avatar answered Dec 13 '22 07:12

ikegami


Note: ikegami's helpful Perl-based answer is the way to go - it performs the entire operation in a single process and is therefore much faster than the Bash + standard utilities solution below.


A bash-based solution needs to avoid loops in which external utilities are called order to perform reasonably.
Your own solution calls two external utilities and creates a subshell in each loop iteration, which means that you'll end up creating about 7.5 million processes(!) in total.

The following solution avoids loops, but, given the sheer number of input files, will still take quite a while to complete (you'll end up creating 4 processes for every 1000 input files, i.e., ca. 10,000 processes in total):

printf '%s\0' * | xargs -0 -n 1000 bash -O nullglob -c '
  dirs=( dir_*/ )
  dir=dir_$(printf %04s $(( 1 + ${#dirs[@]} )))
  mkdir "$dir"; mv "$@" "$dir"' -
  • printf '%s\0' * prints a NUL-separated list of all files in the dir.
    • Note that since printf is a Bash builtin rather than an external utility, the max. command-line length as reported by getconf ARG_MAX does not apply.
  • xargs -0 -n 1000 invokes the specified command with chunks of 1000 input filenames.

    • Note that xargs -0 is nonstandard, but supported on both Linux and BSD/OSX.
    • Using NUL-separated input robustly passes filenames without fear of inadvertently splitting them into multiple parts, and even works with filenames with embedded newlines (though such filenames are very rare).
  • bash -O nullglob -c executes the specified command string with option nullglob turned on, which means that a globbing pattern that matches nothing will expand to the empty string.

    • The command string counts the output directories created so far, so as to determine the name of the next output dir with the next higher index, creates the next output dir, and moves the current batch of (up to) 1000 files there.
like image 34
mklement0 Avatar answered Dec 13 '22 06:12

mklement0