In other words, there can be no other occurrence of the pattern between the end of the match and the second pattern. This needs to be implemented in a single regular expression.
In my specific case I have a page of HTML and need to extract all the content between
<w-block-content><span><div>
and
</div></span></w-block-content>
where
div
elements within the above outer div
. But you can assume each <w-block-content>
element
<span>
child (i.e. it may contain other non-span children)
<div>
child
</div>
within the <span>
within the <w-block-content>
, even if it is unmatched with an opening <div>
.Thus the problem stated in the question at the top.
The following regex successfully matches as long as there are NO internal </div>
tags:
(?:<w-block-content.*[\s\S]*?<div>)([\s\S]*?)(?:<\/div>[\s\S]*?<\/span>[\s\S]*?<\/w-block-content>)
❌ But if there are additional </div>
tags, the match ends prematurely, not including the entirety of the block.
I use [\s\S]*?
to match against arbitrary content, including extra whitespace and newlines.
Here is sample test data:
</tr>
<td>
<div draggable="true" class="source master draggable-box wysiwyg-block" data-block-type="master" data-block-id="96e80afb-afa0-4e46-bfb7-34b80da76112" style="opacity: 1;">
<w-block-content data-block-content-id="96e80afb-afa0-4e46-bfb7-34b80da76112"><span class="source-block-tooltip">
<div>
Další master<br><div><b>Master č. 2</b> </div><br>
</div>
</span></w-block-content>
</div>
</td>
</tr>
</tr>
<td>
<div draggable="true" class="source master draggable-box wysiwyg-block" data-block-type="master" data-block-id="96e80afb-afa0-4e46-bfb7-34b80da76112" style="opacity: 1;">
<w-block-content data-block-content-id="96e80afb-afa0-4e46-bfb7-34b80da76112"><span class="source-block-tooltip">
<div>
Další master<br><b>Master č. 2</b><br>
</div>
</span></w-block-content>
</div>
</td>
</tr>
which I've been testing here: (https://regex101.com/r/jekZhr/3
The first extracted chunk should be:
Další master<br><div><b>Master č. 2</b> </div><br>
I know that regex is not the best tool for handling XML/HTML but I need to know if such regex is possible or if I need to change the structure of data.
The code and data snippet at the bottom includes such tricky input. For example, it includes additional (unexpected) non-whitespace within the matching elements that are not part of the extracted data, HTML comments in this case.
🚩 I inferred this as a requirement from the original regex provided in the question.
None of the other answers as of this writing can handle this input.
⚠️ It also accepts some illegal input, but that's what you get by requiring the use of regular expressions and disallowing a true HTML parser.
On the other hand, a HTML parser will make it difficult to handle the malformed HTML in the sample input given in the question. A conforming parser will handle such "tag soup" by forcibly matching the tag to an open
div
further up the tree, prematurely closing any intervening parent elements on along the way. So not only will it use the first rather than last</div>
with the data record, it may close higher up container elements and wreak havoc on how the rest of the file is parsed.
<w-block-content[^>]*>[\s\S]*?<span[^>]*>[\s\S]*?<div[^>]*>([\s\S]*?)<\/div\s*>(?:(?!<\/div\s*>)[\s\S])*?<\/span\s*>[\s\S]*?<\/w-block-content\s*>/g
The regex meets all the requirements stated in the question:
String.matchAll()
(returns an array of matches)Regexp.exec()
, which returns successive matches on each call, keeping track of where it left off automatically. See test code below.In addition:
dotall
/ |
|
---|---|
<w-block-content[^>]*> |
opening w-block-content "record" tag with arbitrary attributes and whitespace |
[\s\S]*? |
arbitrary whitespace and non-whitespace within w-block-content before span |
<span[^>]*> |
expected nested span with arbitrary attributes and whitespace |
[\s\S]*? |
arbitrary whitespace and non-whitespace within span before div |
<div[^>]*> |
expected nested div with arbitrary attributes and whitespace. This div wraps the data. |
([\s\S]*?) |
the data |
<\/div\s*> |
the closing div tag with arbitrary legal whitespace. |
(?:(?!<\/div\s*>)[\s\S])*? |
arbitrary whitespace and non-whitespace within span after div 🌶 except that it guarantees that </div> matched by the preceding pattern is the last one within the span element. |
<\/span\s*> |
the closing span tag with arbitrary legal whitespace. |
[\s\S]*? |
arbitrary whitespace and non-whitespace within w-block-content after span |
<\/w-block-content\s*> |
the closing w-block-content tag with arbitrary legal whitespace. |
/g |
global flag that enables extracting multiple matches from the input. Affects how String.matchAll and RegExp.exec work. |
'use strict'
const input = `<tr>
<td>
<div draggable="true" class="source master draggable-box wysiwyg-block" data-block-type="master" data-block-id="96e80afb-afa0-4e46-bfb7-34b80da76112" style="opacity: 1;">
<w-block-content data-block-content-id="96e80afb-afa0-4e46-bfb7-34b80da76112">
<span class="source-block-tooltip">
<div>SIMPLE CASE DATA STARTS HERE
Další master<br><b>Master č. 2</b><br>
SIMPLE CASE DATA ENDS HERE</div>
</span>
</w-block-content>
</div>
</td>
</tr><tr>
<td>
<div draggable="true" class="source master draggable-box wysiwyg-block" data-block-type="master" data-block-id="96e80afb-afa0-4e46-bfb7-34b80da76112" style="opacity: 1;">
<w-block-content class="tricky"
data-block-content-id="96e80afb-afa0-4e46-bfb7-34b80da76112" >
<!-- TRICKY: whitespace within expected tags above and below,
and also this comment inserted between the tags -->
<span class="source-block-tooltip"
color="burgandy"
> <!-- TRICKY: some more non-whitespace
between expected tags -->
<div
>TRICKY CASE DATA STARTS HERE
<div> TRICKY inner div
Další master<br><b>Master č. 2</b><br>
</div>
TRICKY unmatched closing div tags
</div> Per the requirements, THIS closing div tag should be ignored and
the one below (the last one before the closing outer tags) should be
treated as the closing tag.
TRICKY CASE DATA ENDS HERE</div> TRICKY closing tags can have whitespace including newlines
<!-- TRICKY more stuff between closing tags -->
</span
>
<!-- TRICKY more stuff between closing tags -->
</w-block-content
>
</div>
</td>
</tr>
`
const regex = /<w-block-content[^>]*>[\s\S]*?<span[^>]*>[\s\S]*?<div[^>]*>([\s\S]*?)<\/div\s*>((?:(?!<\/div\s*>)[\s\S])*?)<\/span\s*>[\s\S]*?<\/w-block-content\s*>/g
function extractNextRecord() {
const match = regex.exec(input)
if (match) {
return {record: match[0], data: match[1]}
} else {
return null
}
}
let output = '', result, count = 0
while (result = extractNextRecord()) {
count++
console.log(`-------------------- RECORD ${count} -----------------------\n${result.record}\n---------------------------------------------------\n\n`)
output += `<hr><pre>${result.data.replaceAll('<', '<')}</pre>`
}
output += '<hr>'
output = `<p>Extracted ${count} records:</p>` + output
document.documentElement.innerHTML = output
As already commented, regex isn't a general purpose tool -- in fact it's a specific tool that matches patterns in a string. Having said that here's a regex solution that will match everything after the first <div>
up to </w-block-content>
. From there find the last index of </div>
and .slice()
it.
RegExp
/(?<=<w-block-content[\s\S]*?<div[\s\S]*?>)
[\s\S]*?
(?=<\/w-block-content>)/g
regex101
Explanation
A look behind: (?<=
...)
must precede the match, but will not be included in the match itself.
A look ahead: (?=
...)
must proceed the match, but will not be included in the match itself.
Segment | Description |
---|---|
(?<=<w-block-content[\s\S]*?<div[\s\S]*?>) |
Find if literal "<w-block-content ", then anything, then literal "<div ", then anything, then literal "> " is before whatever is matched. Do not include it in the match. |
[\s\S]*? |
Match anything |
(?=<\/w-block-content>) |
Find if literal "</w-block-content> " is after whatever is matched. Do not include it in the match. |
Example
const rgx = /(?<=<w-block-content[\s\S]*?<div[\s\S]*?>)[\s\S]*?(?=<\/w-block-content>)/g;
const str = document.querySelector("main").innerHTML;
const A = str.match(rgx)[0];
const idx = A.lastIndexOf("</div>");
const X = A.slice(0, idx);
console.log(X);
<main>
<w-block-content id="A">
CONTENT OF #A
<span id="B">
CONTENT OF #B
<div id="C">
<div>CONTENT OF #C</div>
<div>CONTENT OF #C</div>
</div>
CONTENT OF #B
</span>
CONTENT OF #A
</w-block-content>
</main>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With