Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to split a huge folder?

We have a folder on Windows that's ... huge. I ran "dir > list.txt". The command lost response after 1.5 hours. The output file is about 200 MB. It shows there're at least 2.8 million files. I know the situation is stupid but let's focus the problem itself. If I have such a folder, how can I split it to some "manageable" sub-folders? Surprisingly all the solutions I have come up with all involve getting all the files in the folder at some point, which is a no-no in my case. Any suggestions?

Thank Keith Hill and Mehrdad. I accepted Keith's answer because that's exactly what I wanted to do but I couldn't quite get PS working quickly.

With Mehrdad's tip, I wrote this little program. It took 7+ hours to move 2.8 million files. So the initial dir command did finish. But somehow it didn't return to console.

namespace SplitHugeFolder
{
    class Program
    {
        static void Main(string[] args)
        {
            var destination = args[1];

            if (!Directory.Exists(destination))
                Directory.CreateDirectory(destination);

            var di = new DirectoryInfo(args[0]);

            var batchCount = int.Parse(args[2]);
            int currentBatch = 0;

            string targetFolder = GetNewSubfolder(destination);

            foreach (var fileInfo in di.EnumerateFiles())
            {
                if (currentBatch == batchCount)
                {
                    Console.WriteLine("New Batch...");
                    currentBatch = 0;
                    targetFolder = GetNewSubfolder(destination);
                }

                var source = fileInfo.FullName;
                var target = Path.Combine(targetFolder, fileInfo.Name);
                File.Move(source, target);
                currentBatch++;
            }
        }

        private static string GetNewSubfolder(string parent)
        {
            string newFolder;
            do
            {
                newFolder = Path.Combine(parent, Path.GetRandomFileName());
            } while (Directory.Exists(newFolder));
            Directory.CreateDirectory(newFolder);
            return newFolder;
        }
    }
}
like image 273
treehouse Avatar asked Jan 22 '11 03:01

treehouse


People also ask

How do I split a folder by size?

Right-click on the file and select Split. Select a destination and then choose Split to volumes depending on your size requirement. Click OK to split the file.

How do I split a large file into multiple files?

Open the Zip file. Open the Tools tab. Click the Split Size dropdown button and select the appropriate size for each of the parts of the split Zip file. If you choose Custom Size in the Split Size dropdown list, another small window will open and allow you to enter in a custom size specified in megabytes.


2 Answers

I use Get-ChildItem to index my whole C: drive every night into c:\filelist.txt. That's about 580,000 files and the resulting file size is ~60MB. Admittedly I'm on Win7 x64 with 8 GB of RAM. That said, you might try something like this:

md c:\newdir
Get-ChildItem C:\hugedir -r | 
    Foreach -Begin {$i = $j = 0} -Process { 
        if ($i++ % 100000 -eq 0) { 
            $dest = "C:\newdir\dir$j"
            md $dest
            $j++ 
        }
        Move-Item $_ $dest 
    }

The key is to do the move in a streaming manner. That is, don't collect up all the Get-ChildItem results into a single variable and then proceed. That would require all 2.8 million FileInfos to be in memory at once. Also, if you use the Name parameter on Get-ChildItem it will output a single string containing the file's path relative to the base dir. Even then, perhaps this size will just overwhelm the memory available to you. And no doubt, it will take quite a while to execute. IIRC correctly, my indexing script takes several hours.

If it does work, you should wind up with c:\newdir\dir0 thru dir28 but then again, I haven't tested this script at all so your mileage may vary. BTW this approach assumes that you're huge dir is a pretty flat dir.

Update: Using the Name parameter is almost twice as slow so don't use that parameter.

like image 73
Keith Hill Avatar answered Oct 24 '22 01:10

Keith Hill


I found out the GetChildItem is the slowest option when working with many items in a directory.

Look at the results:

Measure-Command { Get-ChildItem C:\Windows -rec | Out-Null }
TotalSeconds      : 77,3730275
Measure-Command { listdir C:\Windows | Out-Null } 
TotalSeconds      : 20,4077132
measure-command { cmd /c dir c:\windows /s /b | out-null }
TotalSeconds      : 13,8357157

(with listdir function defined like this:

function listdir($dir) {
    $dir
    [system.io.directory]::GetFiles($dir)
    foreach ($d in [system.io.directory]::GetDirectories($dir)) {
        listdir $d
    }
}

)

With this in mind, what I would do: I would stay in PowerShell but use more lowlevel approach with .NET methods:

function DoForFirst($directory, $max, $action) {
    function go($dir, $options)
    {
        foreach ($f in [system.io.Directory]::EnumerateFiles($dir))
        {
            if ($options.Remaining -le 0) { return }
            & $action $f
            $options.Remaining--
        }
        foreach ($d in [system.io.directory]::EnumerateDirectories($dir))
        {
            if ($options.Remaining -le 0) { return }
            go $d $options
        }
    }
    go $directory (New-Object PsObject -Property @{Remaining=$max })
}
doForFirst c:\windows 100 {write-host File: $args }
# I use PsObject to avoid global variables and ref parameters.

To use the code you have to switch to .NET 4.0 runtime -- enumerating methods are new in .NET 4.0.

You can specify any scriptblock as -action parameter, so in your case it would be something like {Move-item -literalPath $args -dest c:\dir }.

Just try to list first 1000 items, I hope it will finish very quickly:

doForFirst c:\yourdirectory 1000 {write-host '.' -nonew }

And of course you can process all items at once, just use

doForFirst c:\yourdirectory ([long]::MaxValue) {move-item ... }

and each item should be processed immediately after it is returned. So the whole list is not read at once and then processed, but it is processed during reading.

like image 36
stej Avatar answered Oct 24 '22 00:10

stej