I have a directory with about 2.5 million files and is over 70 GB.
I want to split this into subdirectories, each with 1000 files in them.
Here's the command I've tried using:
i=0; for f in *; do d=dir_$(printf %03d $((i/1000+1))); mkdir -p $d; mv "$f" $d; let i++; done
That command works for me on a small scale, but I can leave it running for hours on this directory and it doesn't seem to do anything.
I'm open for doing this in any way via command line: perl, python, etc. Just whatever way would be the fastest to get this done...
I suspect that if you checked, you'd noticed your program was actually moving the files, albeit really slowly. Launching a program is rather expensive (at least compared to making a system call), and you do so three or four times per file! As such, the following should be much faster:
perl -e'
my $base_dir_qfn = ".";
my $i = 0;
my $dir;
opendir(my $dh, $base_dir_qfn)
or die("Can'\''t open dir \"$base_dir_qfn\": $!\n");
while (defined( my $fn = readdir($dh) )) {
next if $fn =~ /^(?:\.\.?|dir_\d+)\z/;
my $qfn = "$base_dir_qfn/$fn";
if ($i % 1000 == 0) {
$dir_qfn = sprintf("%s/dir_%03d", $base_dir_qfn, int($i/1000)+1);
mkdir($dir_qfn)
or die("Can'\''t make directory \"$dir_qfn\": $!\n");
}
rename($qfn, "$dir_qfn/$fn")
or do {
warn("Can'\''t move \"$qfn\" into \"$dir_qfn\": $!\n");
next;
};
++$i;
}
'
Note: ikegami's helpful Perl-based answer is the way to go - it performs the entire operation in a single process and is therefore much faster than the Bash + standard utilities solution below.
A bash
-based solution needs to avoid loops in which external utilities are called order to perform reasonably.
Your own solution calls two external utilities and creates a subshell in each loop iteration, which means that you'll end up creating about 7.5 million processes(!) in total.
The following solution avoids loops, but, given the sheer number of input files, will still take quite a while to complete (you'll end up creating 4 processes for every 1000 input files, i.e., ca. 10,000 processes in total):
printf '%s\0' * | xargs -0 -n 1000 bash -O nullglob -c '
dirs=( dir_*/ )
dir=dir_$(printf %04s $(( 1 + ${#dirs[@]} )))
mkdir "$dir"; mv "$@" "$dir"' -
printf '%s\0' *
prints a NUL-separated list of all files in the dir.
printf
is a Bash builtin rather than an external utility, the max. command-line length as reported by getconf ARG_MAX
does not apply.xargs -0 -n 1000
invokes the specified command with chunks of 1000 input filenames.
xargs -0
is nonstandard, but supported on both Linux and BSD/OSX.bash -O nullglob -c
executes the specified command string with option nullglob
turned on, which means that a globbing pattern that matches nothing will expand to the empty string.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With