Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get Line number of Text.RegularExpressions.Regex matches

I use PowerShell to parse a directory of Log files and extract all XML entries out of the log files. This works pretty ok. However since a log file can contain many of these xml bits and pieces I want to put the line number of the specific match it found it in also in the filename of the XML file I write, So I can open the log file and jump to that specific line to do some root cause analysis.

There is a field "index" which is i think the character count, this probably should lead me to the line number, I think however the "index" somehow includes something else as Measure-Object -Character since the value of Index is greater than the size found with Measure-Object-Character e.g. $m.groups[0].Captures[0].Index is 9963166 but Measure-Object -Character overall files in the log dir give 9838833 as max, so I think it counts newlines also.

So the question is probably: If matches delivers me "index" as property, how do I know how many newlines that "index" contains? Do I have to get "index" characters from the file, then check how many newlines it contains and THEN I have the line? Probably.

$tag = 'data_json'
$input_dir = $absolute_root_dir + $specific_dir
$output_dir = $input_dir  + 'ParsedDataFiles\'
$OFS = "`r`n"
$nice_specific_dir = $specific_dir.Replace('\','_')
$nice_specific_dir = $nice_specific_dir.Replace(':','_')
$regex = New-Object Text.RegularExpressions.Regex "<$tag>(.+?)<\/$tag>", ('singleline', 'multiline')
New-Item -ItemType Directory -Force -Path $output_dir
Get-ChildItem -Path $input_dir -Name -File | % {   
    $output_file = $output_dir + $nice_specific_dir + $_ + '.'
    $content = Get-Content ($input_dir + $_)
    $i = 0
    foreach($m in $regex.Matches($content)) {        
        $outputfile_xml = $output_file + $i++ + '.xml'
        $outputfile_txt = $output_file + $i++ + '.txt'
        $xml = [xml] ("<" + $tag+ ">" + $m.Groups[1].Value + "</" + $tag + ">")
        $xml.Save($outputfile_xml)
        $j = 0
        $xml.data_json.Messages.source.item | % { $_.SortOrder + ", " + $_.StartOn + ", " + $_.EndOn + ", " + $_.Id } | sort | %  { 
            (($j++).ToString() + ", " + $_ )   | Out-File $outputfile_txt -Append
        }
    }
}
like image 577
edelwater Avatar asked Mar 04 '23 11:03

edelwater


1 Answers

Note: If what your regular expression matches is guaranteed to never span multiple lines, i.e. if the matched text is guaranteed to be on a single line, consider a simpler Select-String-based solution as shown in js2010's answer; generally, though, a method/expression-based solution, as in this answer, will perform better.


Your first problem is that you use Get-Content without -Raw, which reads the input file as an array of lines rather than a single, multi-line string.

When you pass this array to $regex.Matches(), PowerShell stringifies the array by joining the element with spaces (by default).

Therefore, read your input file with Get-Content -Raw, which ensures that it is read as single, multi-line string with newlines intact:

# Read entire file as single string
$content = Get-Content -Raw ($input_dir + $_)

Once you match against a multi-line string, you can infer the line number by counting the number of lines in the substring up to the character index at which each match was found, via .Substring() and Measure-Object -Line:

Here's a simplified, self-contained example (see the bottom section if you also want to determine the column number):

# Sample multi-line input.
# Note: The <title> elements are at lines 3 and 6, respectively.
$content = @'
<catalog>
  <book id="bk101">
    <title>De Profundis</title>
  </book>
  <book id="bk102">
    <title>Pygmalion</title>
  </book>
</catalog>
'@

# Regex that finds all <title> elements.
# Inline option ('(?...)') 's' makes '.' match newlines too
$regex = [regex] '(?s)<title>.+?</title>'

foreach ($m in $regex.Matches($content)) {
  $lineNumber = ($content.Substring(0, $m.Index + 1) | Measure-Object -Line).Lines
  "Found '$($m.Value)' at index $($m.Index), line $lineNumber"
}

Note the + 1 in $m.Index + 1, which is needed to ensure that the substring doesn't end in a newline character, because Measure-Object line would disregard such a trailing newline. By including at least one additional (non-newline) character, the < of the matched element, the line count is always correct, even if the matched element starts at the very first column.

The above yields:

Found '<title>De Profundis</title>' at index 34, line 3
Found '<title>Pygmalion</title>' at index 96, line 6

In case you want to also get the column number (the 1-based index of the character that starts the match on the line it was found):

Determining the line and column numbers of regex matches in multi-line strings:

# Sample multi-line input.
# Note: The <title> elements are at lines 3 and 6, columns 5 and 7, respectively.
$content = @'
<catalog>
  <book id="bk101">
    <title>De Profundis</title>
  </book>
  <book id="bk102">
      <title>Pygmalion</title>
  </book>
</catalog>
'@

# Regex that finds all <title> elements, along with the
# string that precedes them on the same line:
# Due to use of capture groups, each match $m will contain:
#  * the matched element: $m.Groups[2].Value
#  * the preceding string on the same line: $m.Groups[1].Value
# Inline options ('(?...)'):
#   * 's' makes '.' match newlines too
#   * 'm' makes '^' and '$' match the starts and ends of *individual lines*
$regex = [regex] '(?sm)(^[^\n]*)(<title>.+?</title>)'

foreach ($m in $regex.Matches($content)) {
  $lineNumber = ($content.Substring(0, $m.Index + 1) | Measure-Object -Line).Lines
  $columnNumber = 1 + $m.Groups[1].Value.Length
  "Found '$($m.Groups[2].Value)' at line $lineNumber, column $columnNumber."
}

The above yields:

Found '<title>De Profundis</title>' at line 3, column 5.
Found '<title>Pygmalion</title>' at line 6, column 7.

Note: For simplicity, both solutions above count the lines from the start of the string in every iteration.
In most cases this will likely still perform well enough; if not, see the variant approach in the performance benchmarks below, where the line count is calculated iteratively, with only the lines between the current and the previous match getting counted in a given iteration.


Optional reading: performance comparison of approaches to line counting:

sln's answer proposes using regular expressions also for line counting.

Comparing these approaches as well as the .Substring() plus Measure-Object -Line approach above in terms of performance may be of interest.

The following tests are based on the Time-Command function.

Sample result are from PowerShell Core 7.0.0-preview.3 on macOS 10.14.6, averaged over 100 runs; the absolute numbers will vary depending on the execution environment, but the relative ranking of the approaches (Factor column) seems to be the similar across platforms and PowerShell editions:

  • With 1,000 lines and 1 match on the last line:
Factor Secs (100-run avg.) Command
------ ------------------- -------
1.00   0.001               # .Substring() + Measure-Object -Line, count iteratively…
1.07   0.001               # .Substring() + Measure-Object -Line, count from start…
2.22   0.002               # Repeating capture with nested newline capturing…
6.12   0.006               # Prefix capture group + Measure-Object -Line…
6.72   0.007               # Prefix capture group + newline-matching regex…
7.24   0.007               # Prefix Capture group + -split…
  • With 20,000 lines and 20 evenly spaced matches starting at line 1,000:
Factor Secs (100-run avg.) Command
------ ------------------- -------
1.00   0.014               # .Substring() + Measure-Object -Line, count iteratively…
2.92   0.042               # Repeating capture with nested newline capturing…
7.50   0.107               # .Substring() + Measure-Object -Line, count from start…
8.39   0.119               # Prefix capture group + Measure-Object -Line…
9.50   0.135               # Prefix capture group + newline-matching regex…
9.94   0.141               # Prefix Capture group + -split…

Notes and conclusions:

  • Prefix capture group refers to (variations of) "Way1" from sln's answer, whereas Repeating capture group ... refers to "Way2".

    • Note: For Way2, an (adaptation of) regex (?:.*(\r?\n))*?.*?(match_me) is used below, which is a much-improved version sln added later in a comment, whereas the version still being shown in the body of their answer (as of this writing) - ^(?:.*((?:\r?\n)?))*?(match_me) - wouldn't work for processing multiple matches in a loop.
  • The .Substring() + Measure-Object -Line approach from this answer is the fastest in all cases, but, with many matches to loop over, only if an iterative, between-matches line count is performed (.Substring() + Measure-Object -Line, count iteratively…), whereas the solutions above use a count-lines-from-the-start-for-every-match approach for simplicity (# .Substring() + Measure-Object -Line, count from start…).

  • With the Way1 approach (Prefix capture group), the specific method used for counting the newlines in the prefix match makes relatively little difference, though Measure-Object -Line is the fastest there too.

Here's the source code for the test; it's easy to experiment with match count, total number of input lines, ... by modifying the various variables near the bottom:

# The script blocks with the various approaches.
$sbs =
  { # .Substring() + Measure-Object -Line, count from start
    foreach ($m in [regex]::Matches($txt, 'found')) {
      # !! Measure-Object -Line ignores a trailing \n, so if the match is at the
      # !! start of a line, we need to include at least 1 additional character for the line to register.
      $lineNo = ($txt.Substring(0, $m.Index + 1) | Measure-Object -Line).Lines
      "Found at line $lineNo (substring() + Measure-Object -Line, counted from start every time)."
    }
  },
  { # .Substring() + Measure-Object -Line, count iteratively
    $lineNo = 0; $startNdx = 0
    foreach ($m in [regex]::Matches($txt, 'found')) {
      # !! Measure-Object -Line ignores a trailing \n, so if the match is at the
      # !! start of a line, we need to include at least 1 additional character for the line to register.
      $lineNo += ($txt.Substring($startNdx, $m.Index + 1 - $startNdx) | Measure-Object -Line).Lines
      "Found at line $lineNo (substring() + Measure-Object -Line, counted iteratively)."
      $startNdx = $m.Index + $m.Value.Length
      # !! Decrement the cumulative line number to compensate for counting starting at 1 also for subsequent matches.
      --$lineNo
    }
  },
  { # Prefix capture group + Measure-Object -Line
    $lineNo = 0
    foreach ($m in [regex]::Matches($txt, '(?s)(.*?)found')) {
      # !! Measure-Object -Line ignores a trailing \n, so if the match is at the
      # !! start of a line, we need to include at least 1 additional character for the line to register.
      $lineNo += ($m.Groups[1].Value + '.' | Measure-Object -Line).Lines
      "Found at line $lineNo (prefix capture group + Substring() + Measure-Object -Line)."
      # !! Decrement the cumulative line number to compensate for counting starting at 1 also for subsequent matches.
      --$lineNo
    }
  },
  { # Prefix capture group + newline-matching regex
    $lineNo = 0
    foreach ($m in [regex]::Matches($txt, '(?s)(.*?)found')) {
      $lineNo += 1 + [regex]::Matches($m.Groups[1].Value, '\r?\n').Count
      "Found at line $lineNo (prefix capture group + newline-matching regex)."
      # !! Decrement the cumulative line number to compensate for counting starting at 1 also for subsequent matches.
      --$lineNo
    }
  },
  { # Prefix Capture group + -split
    $lineNo = 0
    foreach ($m in [regex]::Matches($txt, '(?s)(.*?)found')) {
      $lineNo += ($m.Groups[1].Value -split '\r?\n').Count
      "Found at line $lineNo (prefix capture group + -split for counting)."
      # !! Decrement the cumulative line number to compensate for counting starting at 1 also for subsequent matches.
      --$lineNo
    }
  },
  { # Repeating capture with nested newline capturing
    $lineNo = 0
    foreach ($m in [regex]::Matches($txt, '(?:.*(\r?\n))*?.*?found')) {
      $lineNo += 1 + $m.Groups[1].Captures.Count
      "Found at line $lineNo (repeating prefix capture group with newline capture)."
      # !! Decrement the cumulative line number to compensate for counting starting at 1 also for subsequent matches.
      --$lineNo
    }
  }

# Set this to 1 for debugging:
#   * runs the script blocks only once
#   * with 3 matching strings in the put.
#   * shows output so that the expected functionality (number of matches, line numbers) can be verified.
$debug = 0

$matchCount = if ($debug) { 3 } else {
  20 # Set how many matching strings should be present in the input string.
}

# Sample input:
# Create N lines that are 60 chars. wide, with the string to find on the last line...
$n = 1e3 # Set the number of lines per match.
$txt = ((1..($n-1)).foreach('ToString', '0' * 60) -join "`n") + "`n  found`n"
# ...and multiply the original string according to how many matches should be present.
$txt = $txt * $matchCount

$runsToAverage = if ($debug) { 1 } else {
  100   # Set how many test runs to report average timing for.
}
$showOutput = [bool] $debug

# Run the tests.
Time-Command -Count $runsToAverage -OutputToHost:$showOutput $sbs
like image 189
mklement0 Avatar answered Mar 11 '23 19:03

mklement0