Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Capture a keypair split & neighbored by fixed values

Tags:

regex

grep

awk

Using regex in grep how would I capture the value of key and id (excluding quotes) in:

<a href="/stats.php?key=string" id="92340" class=""


Best I got with look behind & look ahead is (?<=<a href="\/stats\.php\?key=).*(?=" class="") but that results in string" id="92340

Ideally the key pair would look like string 92340

Any help is much appreciated.

like image 649
ClassyUnderexposure Avatar asked Feb 06 '26 13:02

ClassyUnderexposure


2 Answers

With your shown samples and attempts please try following GNU grep code. Where I am using -oP option to display only matched things and -P is for enabling PCRE regex.

grep -oP '^<a href="/stats\.php\?key=[^"]*" id="\K\d+'  Input_file

Explanation: Adding detailed explanation for used regex.

^                       ##Match from starting of the value from here.
<a href="/stats\.php    ##Matching <a href="/stats\.php here where DOT is escaped to make it literal here.
\?key=                  ##Matching literal ? followed by key= here.
[^"]*                   ##Matching everything before next occurrence of " including "
" id="                  ##Match " id=" here as per text.
\K                      ##\K is GNU grep option to forget text what you have mactched till now, match it but don't print it.
\d+                     ##Match 1 OR more occurrences of digits here.

To get word string and values also try perl better since it has capturing group concept in it.

perl -pe 's|^<a href="/stats\.php\?key=([^"]*)" id="(\d+).*$|$1 $2|g' Input_file
like image 176
RavinderSingh13 Avatar answered Feb 09 '26 11:02

RavinderSingh13


If you want 2 separate matches with grep, you can use the \G anchor and make use of \K to forget what is matched so far:

grep -oP '(?:<a href="/stats\.php\?key|\G(?!^))[^=<>]*="?\K[^<>"]+' file

Or matching key and id (see the matches):

grep -oP '(?:<a href="/stats\.php\?|\G(?!^))(?:key=|"\h+id=")\K[^"]+' file

Output

string
92340

Or using gnu awk with 2 capture groups:

awk 'match($0, /<a href="\/stats\.php\?key=([^"]*)" id="([^"]*)"/, a) {print a[1], a[2]}' file

Output

string 92340

But if you are free to choose a tool, you could use a dedicated HTML / XML parser as regexes are not aware of any coding structure.

like image 21
The fourth bird Avatar answered Feb 09 '26 12:02

The fourth bird



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!