I am filtering a large web access log file and creating about a dozen smaller ones depending on the regex match. Since I have little experience with regex, I would hope to figure out how to optimise the patterns for better performance.
The source is formatted as follows:
2015-06-14 00:00:06 38.75.53.205 - HTTP 10.250.35.69 80 GET /en/process/dsa/policy/wbss/wbss_current/wbss2013/Wbss13.pdf - 206 299 16722 0 HTTP (etc)
or
2015-06-13 00:00:31 1.22.55.170 - HTTP 157.150.186.68 80 GET /esd/sddev/enable/compl.htm - 200 396 23040 0 HTTP/1.1 Mozilla (etc)
The following are a few of my regex patterns. They all look in the same area of each line, after GET. This is how I have them now:
dsq = "( /esd/sddev/| /creative/)"
dpq = "/dsa/policy/"
pop = "(^((?! /popq/ /caster/(dsa/(policy|qsc|qlation))|(esd/(fed|cdq|qaccount|sddev|creative|forums/rdev))).)*$)"
The first two are looking to match specified patterns, while "pop" is supposed to match everything BUT the specified patterns.
This works as it is, but since my log files tend to rather large (1GB and bigger), and I have some 12 different patterns to match, I was hoping there may be a way to improve the performance of these patterns.
As for the usage, I have a following code, where $profile is one of those listed above (they are in a hash table, and I loop through them separately):
Get-Content $sourcefile -ReadCount 5000 |
ForEach { $_ -match $profile | Add-Content targetfile }
Thank you all for any insight!
Not an improvement on the regex but if you are running a pass on the $sourcefile for every profile you have I can offer a small solution for that.
Get-Content $sourcefile -ReadCount 5000 | ForEach {
switch -regex ($_) {
$dsq {$chosenPath = "file1"; continue}
$dpq {$chosenPath = "file2"; continue}
$pop {$chosenPath = "file3"; continue}
default {}
}
# If no path is set they we skip this step.
If($chosenPath){$_ | Add-Content $chosenPath}
}
Use the -regex switch for switch. You can reference every element of your hashtable for the matches. If a match is found then we set the output file for that pass and stop processing the switch in case there are other matches. The order of the matches would matter this way. Since you stated that the matches are mutually exclusive this should not be an issue.
You could rewrite this with an add-content for every match but I was trying to stop repeating similar code. If you did remove it and put back in all the add-contents you could remove the $null logic that I added.
Regex efficiency
With that last one if you are just trying to match everything other then for the pop why not remove the lookahead, greedy qualifier and anchors and just use -notmatch?
$pop = "/popq/ /caster/(dsa/(policy|qsc|qlation))|esd/(fed|cdq|qaccount|sddev|creative|forums/rdev))"
Get-Content $sourcefile -ReadCount 5000 |
ForEach { $_ -notmatch $pop | Add-Content targetfile }
As a side note I would have expected that you would need a second loop in there to break out the array of 5000 items?
Get-Content $sourcefile -ReadCount 5000 |
ForEach { $_ | ForEach{ $_ -match $profile | Add-Content targetfile }}
I wonder if the regex is being performed on 5000 lines at once instead of the one line you expect it to be.... or maybe its a typo ... or maybe im nuts.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With