Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex match up to the LAST occurrence of a pattern (e.g. </div>) BEFORE another matching pattern (e.g. </div-container>)

In other words, there can be no other occurrence of the pattern between the end of the match and the second pattern. This needs to be implemented in a single regular expression.

In my specific case I have a page of HTML and need to extract all the content between

<w-block-content><span><div>

and

</div></span></w-block-content>

where

  • the elements might have attributes
  • the HTML might be formatted or not - there might be extra white space and newlines
  • there may be other content between any of the above tags, including inner div elements within the above outer div. But you can assume each <w-block-content> element
    • contains ONLY ONE direct child <span> child (i.e. it may contain other non-span children)
      • which contains ONLY ONE direct <div> child
        • which wraps the content that must be extracted
  • 🚩 the match must extend all the way to the last </div> within the <span> within the <w-block-content>, even if it is unmatched with an opening <div>.
  • the solution must be pure ECMAScript-spec Regex. No Javascript code can be used

Thus the problem stated in the question at the top.

The following regex successfully matches as long as there are NO internal </div> tags:

(?:<w-block-content.*[\s\S]*?<div>)([\s\S]*?)(?:<\/div>[\s\S]*?<\/span>[\s\S]*?<\/w-block-content>)

❌ But if there are additional </div> tags, the match ends prematurely, not including the entirety of the block.

I use [\s\S]*? to match against arbitrary content, including extra whitespace and newlines.

Here is sample test data:

</tr>
          <td>
            <div draggable="true" class="source master draggable-box wysiwyg-block" data-block-type="master" data-block-id="96e80afb-afa0-4e46-bfb7-34b80da76112" style="opacity: 1;">
              <w-block-content data-block-content-id="96e80afb-afa0-4e46-bfb7-34b80da76112"><span class="source-block-tooltip">
                  <div>

Další master<br><div><b>Master č. 2</b>                  </div><br>

                  </div>
                </span></w-block-content>
            </div>
          </td>
        </tr>
</tr>
          <td>
            <div draggable="true" class="source master draggable-box wysiwyg-block" data-block-type="master" data-block-id="96e80afb-afa0-4e46-bfb7-34b80da76112" style="opacity: 1;">
              <w-block-content data-block-content-id="96e80afb-afa0-4e46-bfb7-34b80da76112"><span class="source-block-tooltip">
                  <div>

Další master<br><b>Master č. 2</b><br>
                  
                   </div>
                </span></w-block-content>
            </div>
          </td>
        </tr>

which I've been testing here: (https://regex101.com/r/jekZhr/3

The first extracted chunk should be:


Další master<br><div><b>Master č. 2</b>                  </div><br>

                  

I know that regex is not the best tool for handling XML/HTML but I need to know if such regex is possible or if I need to change the structure of data.

like image 421
Radek Avatar asked Sep 01 '25 10:09

Radek


2 Answers

Pure regex solution that accepts trickier input than the sample data provided in the question.

The code and data snippet at the bottom includes such tricky input. For example, it includes additional (unexpected) non-whitespace within the matching elements that are not part of the extracted data, HTML comments in this case.

🚩 I inferred this as a requirement from the original regex provided in the question.

None of the other answers as of this writing can handle this input.

⚠️ It also accepts some illegal input, but that's what you get by requiring the use of regular expressions and disallowing a true HTML parser.

On the other hand, a HTML parser will make it difficult to handle the malformed HTML in the sample input given in the question. A conforming parser will handle such "tag soup" by forcibly matching the tag to an open div further up the tree, prematurely closing any intervening parent elements on along the way. So not only will it use the first rather than last </div> with the data record, it may close higher up container elements and wreak havoc on how the rest of the file is parsed.

The regex

<w-block-content[^>]*>[\s\S]*?<span[^>]*>[\s\S]*?<div[^>]*>([\s\S]*?)<\/div\s*>(?:(?!<\/div\s*>)[\s\S])*?<\/span\s*>[\s\S]*?<\/w-block-content\s*>/g

The regex meets all the requirements stated in the question:

  • It is pure Regexp. It requires no Javascript other than the standard code needed to invoke it.
    • It can be invoked in one call via String.matchAll() (returns an array of matches)
    • Or you can iteratively invoke it to iteratively parse records via Regexp.exec(), which returns successive matches on each call, keeping track of where it left off automatically. See test code below.
    • Regex grouping is used so that the entire outer "record" is parsed and consumed but the "data" within is still available separately. Otherwise parsing successive records would require additional Javascript code to set the pointer to the end of the record before the next parse. That would not only go against the requirements but would also result in redundant and inefficient parsing.
      • The full record is available as group 0 of each match
      • The data within is available as group 1 of each match
  • It handles all legal extra whitespace within tags
  • It handles both whitespace and legal non-whitespace between elements (explained above).

In addition:

  • It works in older browsers, not relying on lookabehind or dotall
    • Lookbehind assertions have backward compatibility limits. Lookbehind was added in ECMAScript 2018, but as you can see at the above link and here not all of even the latest browser support it.
    • dotall also has backward compatibility limits

The regex explained

/
<w-block-content[^>]*> opening w-block-content "record" tag with arbitrary attributes and whitespace
[\s\S]*? arbitrary whitespace and non-whitespace within w-block-content before span
<span[^>]*> expected nested span with arbitrary attributes and whitespace
[\s\S]*? arbitrary whitespace and non-whitespace within span before div
<div[^>]*> expected nested div with arbitrary attributes and whitespace. This div wraps the data.
([\s\S]*?) the data
<\/div\s*> the closing div tag with arbitrary legal whitespace.
(?:(?!<\/div\s*>)[\s\S])*? arbitrary whitespace and non-whitespace within span after div
🌶 except that it guarantees that </div> matched by the preceding pattern is the last one within the span element.
<\/span\s*> the closing span tag with arbitrary legal whitespace.
[\s\S]*? arbitrary whitespace and non-whitespace within w-block-content after span
<\/w-block-content\s*> the closing w-block-content tag with arbitrary legal whitespace.
/g global flag that enables extracting multiple matches from the input. Affects how String.matchAll and RegExp.exec work.

Tricky Test Data and Example Usage/Test Code

'use strict'
const input = `<tr>
          <td>
            <div draggable="true" class="source master draggable-box wysiwyg-block" data-block-type="master" data-block-id="96e80afb-afa0-4e46-bfb7-34b80da76112" style="opacity: 1;">
              <w-block-content data-block-content-id="96e80afb-afa0-4e46-bfb7-34b80da76112">
                <span class="source-block-tooltip">
                  <div>SIMPLE CASE DATA STARTS HERE

Další master<br><b>Master č. 2</b><br>

                  SIMPLE CASE DATA ENDS HERE</div>
                </span>
              </w-block-content>
            </div>
          </td>
</tr><tr>
          <td>
            <div draggable="true" class="source master draggable-box wysiwyg-block" data-block-type="master" data-block-id="96e80afb-afa0-4e46-bfb7-34b80da76112" style="opacity: 1;">
              <w-block-content class="tricky" 
                   data-block-content-id="96e80afb-afa0-4e46-bfb7-34b80da76112"  >
                       <!-- TRICKY: whitespace within expected tags above and below,
                        and also this comment inserted between the tags -->
                <span class="source-block-tooltip"
                      color="burgandy"
                      > <!-- TRICKY: some more non-whitespace
                       between expected tags --> 
                  <div
                     >TRICKY CASE DATA STARTS HERE
                     <div> TRICKY inner div

Další master<br><b>Master č. 2</b><br>
                     </div>
                     TRICKY unmatched closing div tags
                     </div> Per the requirements, THIS closing div tag should be ignored and
                     the one below (the last one before the closing outer tags) should be 
                     treated as the closing tag.
                  TRICKY CASE DATA ENDS HERE</div> TRICKY closing tags can have whitespace including newlines
                  <!-- TRICKY more stuff between closing tags -->
                </span
                   >
                <!-- TRICKY more stuff between closing tags -->
              </w-block-content
                 >
            </div>
          </td>
</tr>
`

const regex = /<w-block-content[^>]*>[\s\S]*?<span[^>]*>[\s\S]*?<div[^>]*>([\s\S]*?)<\/div\s*>((?:(?!<\/div\s*>)[\s\S])*?)<\/span\s*>[\s\S]*?<\/w-block-content\s*>/g

function extractNextRecord() {
    const match = regex.exec(input)
    if (match) {
        return {record: match[0], data: match[1]}
    } else {
        return null
    }
}

let output = '', result, count = 0
while (result = extractNextRecord()) {
    count++
    console.log(`-------------------- RECORD ${count} -----------------------\n${result.record}\n---------------------------------------------------\n\n`)    
    output += `<hr><pre>${result.data.replaceAll('<', '&lt;')}</pre>`
}
output += '<hr>'
output = `<p>Extracted ${count} records:</p>` + output
document.documentElement.innerHTML = output
like image 116
Inigo Avatar answered Sep 04 '25 00:09

Inigo


As already commented, regex isn't a general purpose tool -- in fact it's a specific tool that matches patterns in a string. Having said that here's a regex solution that will match everything after the first <div> up to </w-block-content>. From there find the last index of </div> and .slice() it.

RegExp

/(?<=<w-block-content[\s\S]*?<div[\s\S]*?>)
[\s\S]*?
(?=<\/w-block-content>)/g

regex101

Explanation

A look behind: (?<=...) must precede the match, but will not be included in the match itself.

A look ahead: (?=...) must proceed the match, but will not be included in the match itself.

Segment Description
(?<=<w-block-content[\s\S]*?<div[\s\S]*?>)
Find if literal "<w-block-content", then anything, then literal "<div", then anything, then literal ">" is before whatever is matched. Do not include it in the match.
[\s\S]*?
Match anything
(?=<\/w-block-content>)
Find if literal "</w-block-content>" is after whatever is matched. Do not include it in the match.

Example

const rgx = /(?<=<w-block-content[\s\S]*?<div[\s\S]*?>)[\s\S]*?(?=<\/w-block-content>)/g;

const str = document.querySelector("main").innerHTML;

const A = str.match(rgx)[0];

const idx = A.lastIndexOf("</div>");

const X = A.slice(0, idx);

console.log(X);
<main>
  <w-block-content id="A">
    CONTENT OF #A
    <span id="B">
      CONTENT OF #B
      <div id="C">
        <div>CONTENT OF #C</div>
        <div>CONTENT OF #C</div>
      </div>
      CONTENT OF #B
    </span>
    CONTENT OF #A
  </w-block-content>
</main>
like image 43
zer00ne Avatar answered Sep 04 '25 01:09

zer00ne