Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can a Regex Return the Number of the Line where the Match is Found?

In a text editor, I want to replace a given word with the number of the line number on which this word is found. Is this is possible with Regex?

like image 599
Omar Avatar asked May 18 '14 22:05

Omar


2 Answers

Recursion, Self-Referencing Group (Qtax trick), Reverse Qtax or Balancing Groups

Introduction

The idea of adding a list of integers to the bottom of the input is similar to a famous database hack (nothing to do with regex) where one joins to a table of integers. My original answer used the @Qtax trick. The current answers use either Recursion, the Qtax trick (straight or in a reversed variation), or Balancing Groups.

Yes, it is possible... With some caveats and regex trickery.

  1. The solutions in this answer are meant as a vehicle to demonstrate some regex syntax more than practical answers to be implemented.
  2. At the end of your file, we will paste a list of numbers preceded with a unique delimiter. For this experiment, the appended string is :1:2:3:4:5:6:7 This is a similar technique to a famous database hack that uses a table of integers.
  3. For the first two solutions, we need an editor that uses a regex flavor that allows recursion (solution 1) or self-referencing capture groups (solutions 2 and 3). Two come to mind: Notepad++ and EditPad Pro. For the third solution, we need an editor that supports balancing groups. That probably limits us to EditPad Pro or Visual Studio 2013+.

Input file:

Let's say we are searching for pig and want to replace it with the line number.

We'll use this as input:

my cat
dog
my pig
my cow
my mouse

:1:2:3:4:5:6:7

First Solution: Recursion

Supported languages: Apart from the text editors mentioned above (Notepad++ and EditPad Pro), this solution should work in languages that use PCRE (PHP, R, Delphi), in Perl, and in Python using Matthew Barnett's regex module (untested).

The recursive structure lives in a lookahead, and is optional. Its job is to balance lines that don't contain pig, on the left, with numbers, on the right: think of it as balancing a nested construct like {{{ }}}... Except that on the left we have the no-match lines, and on the right we have the numbers. The point is that when we exit the lookahead, we know how many lines were skipped.

Search:

(?sm)(?=.*?pig)(?=((?:^(?:(?!pig)[^\r\n])*(?:\r?\n))(?:(?1)|[^:]+)(:\d+))?).*?\Kpig(?=.*?(?(2)\2):(\d+))

Free-Spacing Version with Comments:

(?xsm)             # free-spacing mode, multi-line
(?=.*?pig)        # fail right away if pig isn't there

(?=               # The Recursive Structure Lives In This Lookahead
(                 # Group 1
   (?:               # skip one line 
      ^              
      (?:(?!pig)[^\r\n])*  # zero or more chars not followed by pig
      (?:\r?\n)      # newline chars
    ) 
    (?:(?1)|[^:]+)   # recurse Group 1 OR match all chars that are not a :
    (:\d+)           # match digits
)?                 # End Group 
)                 # End lookahead. 
.*?\Kpig                # get to pig
(?=.*?(?(2)\2):(\d+))   # Lookahead: capture the next digits

Replace: \3

In the demo, see the substitutions at the bottom. You can play with the letters on the first two lines (delete a space to make pig) to move the first occurrence of pig to a different line, and see how that affects the results.


Second Solution: Group that Refers to Itself ("Qtax Trick")

Supported languages: Apart from the text editors mentioned above (Notepad++ and EditPad Pro), this solution should work in languages that use PCRE (PHP, R, Delphi), in Perl, and in Python using Matthew Barnett's regex module (untested). The solution is easy to adapt to .NET by converting the \K to a lookahead and the possessive quantifier to an atomic group (see the .NET Version a few lines below.)

Search:

(?sm)(?=.*?pig)(?:(?:^(?:(?!pig)[^\r\n])*(?:\r?\n))(?=[^:]+((?(1)\1):\d+)))*+.*?\Kpig(?=[^:]+(?(1)\1):(\d+))

.NET version: Back to the Future

.NET does not have \K. It its place, we use a "back to the future" lookbehind (a lookbehind that contains a lookahead that skips ahead of the match). Also, we need to use an atomic group instead of a possessive quantifier.

(?sm)(?<=(?=.*?pig)(?=(?>(?:^(?:(?!pig)[^\r\n])*(?:\r?\n))(?=[^:]+((?(1)\1):\d+)))*).*)pig(?=[^:]+(?(1)\1):(\d+))

Free-Spacing Version with Comments (Perl / PCRE Version):

(?xsm)             # free-spacing mode, multi-line
(?=.*?pig)        # lookahead: if pig is not there, fail right away to save the effort
(?:               # start counter-line-skipper (lines that don't include pig)
   (?:               # skip one line 
      ^              # 
      (?:(?!pig)[^\r\n])*  # zero or more chars not followed by pig
      (?:\r?\n)      # newline chars
    )   
   # for each line skipped, let Group 1 match an ever increasing portion of the numbers string at the bottom
   (?=             # lookahead
      [^:]+           # skip all chars that are not colons
      (               # start Group 1
        (?(1)\1)      # match Group 1 if set
        :\d+          # match a colon and some digits
      )               # end Group 1
   )               # end lookahead
)*+               # end counter-line-skipper: zero or more times
.*?               # match
\K                # drop everything we've matched so far
pig               # match pig (this is the match!)
(?=[^:]+(?(1)\1):(\d+))   # capture the next number to Group 2

Replace:

\2

Output:

my cat
dog
my 3
my cow
my mouse

:1:2:3:4:5:6:7

In the demo, see the substitutions at the bottom. You can play with the letters on the first two lines (delete a space to make pig) to move the first occurrence of pig to a different line, and see how that affects the results.

Choice of Delimiter for Digits

In our example, the delimiter : for the string of digits is rather common, and could happen elsewhere. We can invent a UNIQUE_DELIMITER and tweak the expression slightly. But the following optimization is even more efficient and lets us keep the :


Optimization on Second Solution: Reverse String of Digits

Instead of pasting our digits in order, it may be to our benefit to use them in the reverse order: :7:6:5:4:3:2:1

In our lookaheads, this allows us to get down to the bottom of the input with a simple .*, and to start backtracking from there. Since we know we're at the end of the string, we don't have to worry about the :digits being part of another section of the string. Here's how to do it.

Input:

my cat pi g
dog p ig
my pig
my cow
my mouse

:7:6:5:4:3:2:1

Search:

(?xsm)             # free-spacing mode, multi-line
(?=.*?pig)        # lookahead: if pig is not there, fail right away to save the effort
(?:               # start counter-line-skipper (lines that don't include pig)
   (?:               # skip one line that doesn't have pig
      ^              # 
      (?:(?!pig)[^\r\n])*  # zero or more chars not followed by pig
      (?:\r?\n)      # newline chars
    )   
   # Group 1 matches increasing portion of the numbers string at the bottom
   (?=             # lookahead
      .*           # get to the end of the input
      (               # start Group 1
        :\d+          # match a colon and some digits
        (?(1)\1)      # match Group 1 if set
      )               # end Group 1
   )               # end lookahead
)*+               # end counter-line-skipper: zero or more times
.*?               # match
\K                # drop match so far
pig               # match pig (this is the match!)
(?=.*(\d+)(?(1)\1))   # capture the next number to Group 2

Replace: \2

See the substitutions in the demo.

Third Solution: Balancing Groups

This solution is specific to .NET.

Search:

(?m)(?<=\A(?<c>^(?:(?!pig)[^\r\n])*(?:\r?\n))*.*?)pig(?=[^:]+(?(c)(?<-c>:\d+)*):(\d+))

Free-Spacing Version with Comments:

(?xm)                # free-spacing, multi-line
(?<=                 # lookbehind
   \A                # 
   (?<c>               # skip one line that doesn't have pig
                       # The length of Group c Captures will serve as a counter
     ^                    # beginning of line
     (?:(?!pig)[^\r\n])*  # zero or more chars not followed by pig
     (?:\r?\n)            # newline chars
   )                   # end skipper
   *                   # repeat skipper
   .*?                 # we're on the pig line: lazily match chars before pig
   )                # end lookbehind
pig                 # match pig: this is the match
(?=                 # lookahead
   [^:]+               # get to the digits
   (?(c)               # if Group c has been set
     (?<-c>:\d+)         # decrement c while we match a group of digits
     *                   # repeat: this will only repeat as long as the length of Group c captures > 0 
   )                   # end if Group c has been set
   :(\d+)              # Match the next digit group, capture the digits
)                    # end lokahead

Replace: $1


Reference

  • Qtax trick
  • On Which Line Number Was the Regex Match Found?
like image 105
zx81 Avatar answered Sep 22 '22 14:09

zx81


Because you didn't specify which text editor, in vim it would be:

:%s/searched_word/\=printf('%-4d', line('.'))/g (read more)

But as somebody mentioned it's not a question for SO but rather Super User ;)

like image 40
Maciek Avatar answered Sep 22 '22 14:09

Maciek