Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filtering a log file quickly

I'm trying to write a PowerShell script to do some analysis on production log files.

I need to filter roughly 2.5 million lines of text by roughly 50-100 values - potentially 250,000,000 iterations if O(nm).

I've tried Get-Content | Select-String but this seems incredibly slow.

Is there any way to approach this without iterating over every line once for each value?

EDIT

So, the log files look a bit like this (datetime : process_id : log_level : message)

2016-01-30 14:01:22.349 [ 27]  INFO XXX YYY XXXFX
2016-01-30 14:01:28.146 [ 16]  INFO XXXD YY Z YYY XXXX
2016-01-30 14:01:28.162 [ 16]  DEBUG YY XXXXX YY XX P YYY
2016-01-30 14:01:28.165 [ 16]  DEBUG YY XXXXX YY XX YYY
2016-01-30 14:01:28.167 [ 16]  DEBUG YY XXXXX YY XX YYY
2016-01-30 14:01:28.912 [ 27]  INFO XXX YY XXGXXX YYYYYY YY XX

and I may be looking for the values D, F, G and Z.

The values could be strings of binary digits, hexadecimal digits, combinations of the two, strings of regular text and punctuation, or pipe-delimited values.

like image 275
Alex McMillan Avatar asked Jun 23 '26 19:06

Alex McMillan


2 Answers

Rules of thumb:

  • StreamReader is faster than Get-Content is faster than Import-Csv.
  • String operation is faster than wildcard match is faster than regular expression match.

You probably want something like this if it's sufficient to check whether your log lines contain any of the given strings:

$reader  = [IO.StreamReader]'C:\path\to\your.log'
$filters = 'foo', 'bar', ...

while ($reader.Peek() -ge 0) {
  $line = $reader.ReadLine()
  if ($filters | Where-Object {$line.Contains($_)}) {
    $line
  }
}

$reader.Close()
$reader.Dispose()

If you want to use a StreamWriter instead of just echoing the output simply adjust the code like this:

$reader  = [IO.StreamReader]'C:\path\to\your.log'
$writer  = [IO.StreamWriter]'C:\path\to\output.txt'
$filters = 'foo', 'bar', ...

while ($reader.Peek() -ge 0) {
  $line = $reader.ReadLine()
  if ($filters | Where-Object {$line.Contains($_)}) {
    $writer.WriteLine($line)
  }
}

$reader.Close(); $reader.Dispose()
$writer.Close(); $writer.Dispose()

Depending on the structure of your log lines as well as the filter values and how they need to be applied the filter logic may need adjustments. You need to actually show the log format and filter examples for that, though.

like image 166
Ansgar Wiechers Avatar answered Jun 25 '26 13:06

Ansgar Wiechers


I would try a StreamReader + StreamWriter to speed up reading/writing as Get-Content is slow for big files. Also, I would try to make one regex (word OR word OR word etc.) to avoid hundreds of iterations. Ex:

$words = "foo","bar","donkey" 
#Create regex-pattern (usually faster to match)
$regex = ($words | % { [regex]::Escape($_) }) -join '|'

$reader = New-Object System.IO.StreamReader -ArgumentList "c:\myinputfile.txt"
$writer = New-Object System.IO.StreamWriter -ArgumentList "c:\myOUTputfile.txt"

while (($line = $reader.ReadLine()) -ne $null) {
    if($line -match $regex) { $writer.WriteLine($line) }
}


#Close writer
$writer.Close()
$writer.Dispose()

#Close reader
$reader.Close()
$reader.Dispose()
like image 24
Frode F. Avatar answered Jun 25 '26 14:06

Frode F.



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!