Regex match up to the LAST occurrence of a pattern (e.g. ) BEFORE another matching pattern (e.g. )

Question

In other words, there can be no other occurrence of the pattern between the end of the match and the second pattern. This needs to be implemented in a single regular expression.

In my specific case I have a page of HTML and need to extract all the content between

<w-block-content><span><div>

and

</div></span></w-block-content>

where

the elements might have attributes
the HTML might be formatted or not - there might be extra white space and newlines
there may be other content between any of the above tags, including inner div elements within the above outer div. But you can assume each <w-block-content> element
- contains ONLY ONE direct child <span> child (i.e. it may contain other non-span children)
  - which contains ONLY ONE direct <div> child
    - which wraps the content that must be extracted
🚩 the match must extend all the way to the last </div> within the <span> within the <w-block-content>, even if it is unmatched with an opening <div>.
the solution must be pure ECMAScript-spec Regex. No Javascript code can be used

Thus the problem stated in the question at the top.

The following regex successfully matches as long as there are NO internal </div> tags:

(?:<w-block-content.*[\s\S]*?<div>)([\s\S]*?)(?:<\/div>[\s\S]*?<\/span>[\s\S]*?<\/w-block-content>)

❌ But if there are additional </div> tags, the match ends prematurely, not including the entirety of the block.

I use [\s\S]*? to match against arbitrary content, including extra whitespace and newlines.

Here is sample test data:

</tr>
          <td>
            <div draggable="true" class="source master draggable-box wysiwyg-block" data-block-type="master" data-block-id="96e80afb-afa0-4e46-bfb7-34b80da76112" style="opacity: 1;">
              <w-block-content data-block-content-id="96e80afb-afa0-4e46-bfb7-34b80da76112"><span class="source-block-tooltip">
                  <div>

Další master<br><div><b>Master č. 2</b>                  </div><br>

                  </div>
                </span></w-block-content>
            </div>
          </td>
        </tr>
</tr>
          <td>
            <div draggable="true" class="source master draggable-box wysiwyg-block" data-block-type="master" data-block-id="96e80afb-afa0-4e46-bfb7-34b80da76112" style="opacity: 1;">
              <w-block-content data-block-content-id="96e80afb-afa0-4e46-bfb7-34b80da76112"><span class="source-block-tooltip">
                  <div>

Další master<br><b>Master č. 2</b><br>
                  
                   </div>
                </span></w-block-content>
            </div>
          </td>
        </tr>

which I've been testing here: (https://regex101.com/r/jekZhr/3

The first extracted chunk should be:


Další master<br><div><b>Master č. 2</b>                  </div><br>

I know that regex is not the best tool for handling XML/HTML but I need to know if such regex is possible or if I need to change the structure of data.

Inigo · Accepted Answer

Pure regex solution that accepts trickier input than the sample data provided in the question.

The code and data snippet at the bottom includes such tricky input. For example, it includes additional (unexpected) non-whitespace within the matching elements that are not part of the extracted data, HTML comments in this case.

🚩 I inferred this as a requirement from the original regex provided in the question.

None of the other answers as of this writing can handle this input.

⚠️ It also accepts some illegal input, but that's what you get by requiring the use of regular expressions and disallowing a true HTML parser.

On the other hand, a HTML parser will make it difficult to handle the malformed HTML in the sample input given in the question. A conforming parser will handle such "tag soup" by forcibly matching the tag to an open div further up the tree, prematurely closing any intervening parent elements on along the way. So not only will it use the first rather than last </div> with the data record, it may close higher up container elements and wreak havoc on how the rest of the file is parsed.

The regex

<w-block-content[^>]*>[\s\S]*?<span[^>]*>[\s\S]*?<div[^>]*>([\s\S]*?)</div\s*>(?:(?!</div\s*>)[\s\S])*?</span\s*>[\s\S]*?</w-block-content\s*>/g

The regex meets all the requirements stated in the question:

It is pure Regexp. It requires no Javascript other than the standard code needed to invoke it.
- It can be invoked in one call via String.matchAll() (returns an array of matches)
- Or you can iteratively invoke it to iteratively parse records via Regexp.exec(), which returns successive matches on each call, keeping track of where it left off automatically. See test code below.
- Regex grouping is used so that the entire outer "record" is parsed and consumed but the "data" within is still available separately. Otherwise parsing successive records would require additional Javascript code to set the pointer to the end of the record before the next parse. That would not only go against the requirements but would also result in redundant and inefficient parsing.
  - The full record is available as group 0 of each match
  - The data within is available as group 1 of each match
It handles all legal extra whitespace within tags
It handles both whitespace and legal non-whitespace between elements (explained above).

In addition:

It works in older browsers, not relying on lookabehind or dotall
- Lookbehind assertions have backward compatibility limits. Lookbehind was added in ECMAScript 2018, but as you can see at the above link and here not all of even the latest browser support it.
- dotall also has backward compatibility limits

The regex explained

`/`
`<w-block-content[^>]*>`	opening `w-block-content` "record" tag with arbitrary attributes and whitespace
`[\s\S]*?`	arbitrary whitespace and non-whitespace within `w-block-content` before `span`
`<span[^>]*>`	expected nested `span` with arbitrary attributes and whitespace
`[\s\S]*?`	arbitrary whitespace and non-whitespace within `span` before `div`
`<div[^>]*>`	expected nested `div` with arbitrary attributes and whitespace. This `div` wraps the data.
`([\s\S]*?)`	the data
`</div\s*>`	the closing `div` tag with arbitrary legal whitespace.
`(?:(?!</div\s>)[\s\S])?`	arbitrary whitespace and non-whitespace within `span` after `div` 🌶 except that it guarantees that `</div>` matched by the preceding pattern is the last one within the `span` element.
`</span\s*>`	the closing `span` tag with arbitrary legal whitespace.
`[\s\S]*?`	arbitrary whitespace and non-whitespace within `w-block-content` after `span`
`</w-block-content\s*>`	the closing `w-block-content` tag with arbitrary legal whitespace.
`/g`	`global` flag that enables extracting multiple matches from the input. Affects how `String.matchAll` and `RegExp.exec` work.

Tricky Test Data and Example Usage/Test Code

'use strict'
const input = `<tr>
          <td>
            <div draggable="true" class="source master draggable-box wysiwyg-block" data-block-type="master" data-block-id="96e80afb-afa0-4e46-bfb7-34b80da76112" style="opacity: 1;">
              <w-block-content data-block-content-id="96e80afb-afa0-4e46-bfb7-34b80da76112">
                <span class="source-block-tooltip">
                  <div>SIMPLE CASE DATA STARTS HERE

Další master<br><b>Master č. 2</b><br>

                  SIMPLE CASE DATA ENDS HERE</div>
                </span>
              </w-block-content>
            </div>
          </td>
</tr><tr>
          <td>
            <div draggable="true" class="source master draggable-box wysiwyg-block" data-block-type="master" data-block-id="96e80afb-afa0-4e46-bfb7-34b80da76112" style="opacity: 1;">
              <w-block-content class="tricky" 
                   data-block-content-id="96e80afb-afa0-4e46-bfb7-34b80da76112"  >
                       <!-- TRICKY: whitespace within expected tags above and below,
                        and also this comment inserted between the tags -->
                <span class="source-block-tooltip"
                      color="burgandy"
                      > <!-- TRICKY: some more non-whitespace
                       between expected tags --> 
                  <div
                     >TRICKY CASE DATA STARTS HERE
                     <div> TRICKY inner div

Další master<br><b>Master č. 2</b><br>
                     </div>
                     TRICKY unmatched closing div tags
                     </div> Per the requirements, THIS closing div tag should be ignored and
                     the one below (the last one before the closing outer tags) should be 
                     treated as the closing tag.
                  TRICKY CASE DATA ENDS HERE</div> TRICKY closing tags can have whitespace including newlines
                  <!-- TRICKY more stuff between closing tags -->
                </span
                   >
                <!-- TRICKY more stuff between closing tags -->
              </w-block-content
                 >
            </div>
          </td>
</tr>
`

const regex = /<w-block-content[^>]*>[\s\S]*?<span[^>]*>[\s\S]*?<div[^>]*>([\s\S]*?)</div\s*>((?:(?!</div\s*>)[\s\S])*?)</span\s*>[\s\S]*?</w-block-content\s*>/g

function extractNextRecord() {
    const match = regex.exec(input)
    if (match) {
        return {record: match[0], data: match[1]}
    } else {
        return null
    }
}

let output = '', result, count = 0
while (result = extractNextRecord()) {
    count++
    console.log(`-------------------- RECORD ${count} -----------------------
${result.record}
---------------------------------------------------

`)    
    output += `<hr><pre>${result.data.replaceAll('<', '&lt;')}</pre>`
}
output += '<hr>'
output = `<p>Extracted ${count} records:</p>` + output
document.documentElement.innerHTML = output

zer00ne · Answer

As already commented, regex isn't a general purpose tool -- in fact it's a specific tool that matches patterns in a string. Having said that here's a regex solution that will match everything after the first <div> up to </w-block-content>. From there find the last index of </div> and .slice() it.

RegExp

/(?<=<w-block-content[\s\S]*?<div[\s\S]*?>)
[\s\S]*?
(?=<\/w-block-content>)/g

regex101

Explanation

A look behind: (?<=...) must precede the match, but will not be included in the match itself.

A look ahead: (?=...) must proceed the match, but will not be included in the match itself.

Segment	Description
(?<=<w-block-content[\s\S]?<div[\s\S]?>)	Find if literal "`<w-block-content`", then anything, then literal "`<div`", then anything, then literal "`>`" is before whatever is matched. Do not include it in the match.
[\s\S]*?	Match anything
(?=<\/w-block-content>)	Find if literal "`</w-block-content>`" is after whatever is matched. Do not include it in the match.

Example

const rgx = /(?<=<w-block-content[\s\S]*?<div[\s\S]*?>)[\s\S]*?(?=<\/w-block-content>)/g;

const str = document.querySelector("main").innerHTML;

const A = str.match(rgx)[0];

const idx = A.lastIndexOf("</div>");

const X = A.slice(0, idx);

console.log(X);

<main>
  <w-block-content id="A">
    CONTENT OF #A
    <span id="B">
      CONTENT OF #B
      <div id="C">
        <div>CONTENT OF #C</div>
        <div>CONTENT OF #C</div>
      </div>
      CONTENT OF #B
    </span>
    CONTENT OF #A
  </w-block-content>
</main>

Regex match up to the LAST occurrence of a pattern (e.g. </div>) BEFORE another matching pattern (e.g. </div-container>)

Tags:

javascript

regex

Radek

2 Answers

Pure regex solution that accepts trickier input than the sample data provided in the question.

The regex

The regex explained

Tricky Test Data and Example Usage/Test Code

Inigo

zer00ne

Recent Activity

Donate For Us

Regex match up to the LAST occurrence of a pattern (e.g. </div>) BEFORE another matching pattern (e.g. </div-container>)

Tags:

javascript

regex

Radek

2 Answers

Pure regex solution that accepts trickier input than the sample data provided in the question.

The regex

The regex explained

Tricky Test Data and Example Usage/Test Code

Inigo

zer00ne

Related questions

Recent Activity

Donate For Us