I have standard Apache log files, between 500Mb and 2GB in size. I need to sort the lines in them (each line starts with a date yyyy-MM-dd hh:mm:ss, so no treatment necessary for sorting.
The simplest and most obvious thing that comes to mind is
Get-Content unsorted.txt | sort | get-unique > sorted.txt
I am guessing (without having tried it) that doing this using Get-Content
would take forever in my 1GB files. I don't quite know my way around System.IO.StreamReader
, but I'm curious if an efficient solution could be put together using that?
Thanks to anyone who might have a more efficient idea.
[edit]
I tried this subsequently, and it took a very long time; some 10 minutes for 400MB.
Get-Content
is terribly ineffective for reading large files. Sort-Object
is not very fast, too.
Let's set up a base line:
$sw = [System.Diagnostics.Stopwatch]::StartNew();
$c = Get-Content .\log3.txt -Encoding Ascii
$sw.Stop();
Write-Output ("Reading took {0}" -f $sw.Elapsed);
$sw = [System.Diagnostics.Stopwatch]::StartNew();
$s = $c | Sort-Object;
$sw.Stop();
Write-Output ("Sorting took {0}" -f $sw.Elapsed);
$sw = [System.Diagnostics.Stopwatch]::StartNew();
$u = $s | Get-Unique
$sw.Stop();
Write-Output ("uniq took {0}" -f $sw.Elapsed);
$sw = [System.Diagnostics.Stopwatch]::StartNew();
$u | Out-File 'result.txt' -Encoding ascii
$sw.Stop();
Write-Output ("saving took {0}" -f $sw.Elapsed);
With a 40 MB file having 1.6 million lines (made of 100k unique lines repeated 16 times) this script produces the following output on my machine:
Reading took 00:02:16.5768663
Sorting took 00:02:04.0416976
uniq took 00:01:41.4630661
saving took 00:00:37.1630663
Totally unimpressive: more than 6 minutes to sort tiny file. Every step can be improved a lot. Let's use StreamReader
to read file line by line into HashSet
which will remove duplicates, then copy data to List
and sort it there, then use StreamWriter
to dump results back.
$hs = new-object System.Collections.Generic.HashSet[string]
$sw = [System.Diagnostics.Stopwatch]::StartNew();
$reader = [System.IO.File]::OpenText("D:\log3.txt")
try {
while (($line = $reader.ReadLine()) -ne $null)
{
$t = $hs.Add($line)
}
}
finally {
$reader.Close()
}
$sw.Stop();
Write-Output ("read-uniq took {0}" -f $sw.Elapsed);
$sw = [System.Diagnostics.Stopwatch]::StartNew();
$ls = new-object system.collections.generic.List[string] $hs;
$ls.Sort();
$sw.Stop();
Write-Output ("sorting took {0}" -f $sw.Elapsed);
$sw = [System.Diagnostics.Stopwatch]::StartNew();
try
{
$f = New-Object System.IO.StreamWriter "d:\result2.txt";
foreach ($s in $ls)
{
$f.WriteLine($s);
}
}
finally
{
$f.Close();
}
$sw.Stop();
Write-Output ("saving took {0}" -f $sw.Elapsed);
this script produces:
read-uniq took 00:00:32.2225181
sorting took 00:00:00.2378838
saving took 00:00:01.0724802
On same input file it runs more than 10 times faster. I am still surprised though it takes 30 seconds to read file from disk.
I've grown to hate this part of windows powershell, it is a memory hog on these larger files. One trick is to read the lines [System.IO.File]::ReadLines('file.txt') | sort -u | out-file file2.txt -encoding ascii
Another trick, seriously is to just use linux.
cat file.txt | sort -u > output.txt
Linux is so insanely fast at this, it makes me wonder what the heck microsoft is thinking with this set up.
It may not be feasible in all cases, and i understand, but if you have a linux machine, you can copy 500 megs to it, sort and unique it, and copy it back in under a couple minutes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With