Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

One-liner to split very large directory into smaller directories on Unix

How do you to split a very large directory, containing potentially millions of files, into smaller directories of some custom defined maximum number of files, such as 100 per directory, on UNIX?

Bonus points if you know of a way to have wget download files into these subdirectories automatically. So if there are 1 million .html pages at the top-level path at www.example.com, such as

/1.html
/2.html
...
/1000000.html

and we only want 100 files per directory, it will download them to folders something like

./www.example.com/1-100/1.html
...
./www.example.com/999901-1000000/1000000.html

Only really need to be able to run the UNIX command on the folder after wget has downloaded the files, but if it's possible to do this with wget as it's downloading I'd love to know!

like image 404
Lance Avatar asked Jun 23 '12 23:06

Lance


2 Answers

Another option:

i=1;while read l;do mkdir $i;mv $l $((i++));done< <(ls|xargs -n100)

Or using parallel:

ls|parallel -n100 mkdir {#}\;mv {} {#}

-n100 takes 100 arguments at a time and {#} is the sequence number of the job.

like image 92
nisetama Avatar answered Sep 23 '22 16:09

nisetama


You can run this through a couple of loops, which should do the trick (at least for the numeric part of the file name). I think that doing this as a one-liner is over-optimistic.

#! /bin/bash
for hundreds in {0..99}
do
    min=$(($hundreds*100+1))
    max=$(($hundreds*100+100))
    current_dir="$min-$max"
    mkdir $current_dir
    for ones_tens in {1..100}
    do
        current_file="$(($hundreds*100+$ones_tens)).html"
        #touch $current_file 
        mv $current_file $current_dir
    done
done

I did performance testing by first commenting out mkdir $current_dir and mv $current_file $current_dir and uncommenting touch $current_file. This created 10000 files (one-hundredth of your target of 1000000 files). Once the files were created, I reverted to the script as written:

$ time bash /tmp/test.bash 2>&1 

real        0m27.700s
user        0m26.426s
sys         0m17.653s

As long as you aren't moving files across file systems, the time for each mv command should be constant, so you should see similar or better performance. Scaling this up to a million files would give you around 27700 seconds, i.e. 46 minutes. There are several avenues for optimization, such as moving all files for a given directory in one command, or removing the inner for loop.

Doing the 'wget' to grab a million files is going to take far longer than this, and is almost certainly going to require some optimization; preserving bandwidth in http headers alone will cut down run time by hours. I don't think that a shell script is probably the right tool for that job; using a library such as WWW::Curl on cpan will be much easier to optimize.

like image 45
Barton Chittenden Avatar answered Sep 22 '22 16:09

Barton Chittenden