PowerShell : Divide Files by Line in Faster Way

Question

We're considering to divide huge file, over 25GB and to import into Redshift DWH. So far, we can simply divide the file by 1000000 lines, but it took almost two hours to divide into 111 files.

$i=1; cat .\TRGET_FILE.csv -ReadCount 1000000 | % { $_ > TRGET_FILE_$i.csv;$i++ }

Also we'd like to escape the word '\' to '\' in each files, so as not to prevent Redshift data load. But it consumes both of time and resources in this way.

Could you kindly let me know if you'll know any better solutions? Thank you.

briantist · Accepted Answer

cat is an alias for Get-Content which is notoriously slow at reading files. I also suspect that using redirection may not be the best way to write the target file either.

Let's try using some .Net classes and methods instead:

$bigfile = '.\TRGET_FILE.csv'
$outfile = '.\TRGET_FILE{0:d5}.csv'
$linecount = 1000000
$i = 0
$outstream = $null

foreach ($line in [System.IO.File]::ReadLines($bigfile)) {
    if (($i % $linecount) -eq 0) {
        if ($null -ne $outstream) { $outstream.Close() }
        $outstream = [System.IO.StreamWriter]::new(($outfile -f ($i/$linecount)))
    }

    $outstream.WriteLine($line.Replace('\', '\'))

    $i++
}
$outstream.Close()

The [System.IO.File]::ReadLines method is very fast, and we can iterate over it quickly with a foreach loop, without reading every line from the file first.

Since each iteration of the loop processes a single line, we'll write it out to the target file. But we don't want to use something like Out-File -Append because it will reopen and close the file every time.

So instead, when we calculate that we've reached the number of lines (or on the first iteration), we can open a new file as a [System.IO.File]::StreamWriter, and that way we have it open already, and can write out each line.

When we write the line, we also replace a single backslash \ character with two backslashes.

When we reach the desired number of lines, we first check if the $outstream is $null (it will be $null the first time), and if it's not null, we close the stream (to close the output file), and then create a new one, with the new file name.

The output file name is templated to a 5 digit number via string templating. {0:d5} -- 0 means the first item in the template, d is for a number, 5 tells it to ensure the number is at least 5 digits, so it will pad it with zeroes. That will help with file ordering.

Finally at the end of the loop, which will end before reaching the desired number of lines (the tail of the file), we close that last stream.

頑張ってね

LarryX · Answer

Can you try to limit the chunk size to 1m ~ 2m by reducing the readcount? some thing like this:

$i = 1; $j = 1
Get-Content .\TRGET_FILE.csv -ReadCount 10000 | 
    ForEach-Object { 
        $_ | Add-Content -Path TRGET_FILE_$i.csv
        $j++
        if ($j -eq 100) { 
            j=1; i++ 
        } 
    }

if this works better you can apply the replacement like this:

$i = 1; $j = 1 
Get-Content .\TRGET_FILE.csv -ReadCount 10000 | 
    ForEach-Object { 
        $_.replace('\', '\') | Add-Content -Path TRGET_FILE$i.csv
        $j++
        if ($j -eq 100) {
            j=1; i++
        }
    }

A lot of this depends the disk IO, so have a disk that is super fast on read and write block will help.

PowerShell : Divide Files by Line in Faster Way

Tags:

powershell

Sachiko

2 Answers

briantist

LarryX

Recent Activity

Donate For Us

PowerShell : Divide Files by Line in Faster Way

Tags:

powershell

Sachiko

2 Answers

briantist

LarryX

Related questions

Recent Activity

Donate For Us