I am using php and regex to find unclosed html tags in a string :
This is my string :
$s="<div><h2>Hello world<h2><p>It's 7Am where I live<p><div>";
You can see All tags here are not closed.
I want to find all unclosed tags, but the problem is that my regex is matching opening tags also.
Here is my regex so far
/<[^>]+>/i
And this is my preg_match_all() function
preg_match_all("/<[^>]+>/i",$s,$v);
print_r($v);
What do I need to change in my regex to match only the unclosed tags?
<h2>
<p>
<div>
The strip_tags() function strips a string from HTML, XML, and PHP tags. Note: HTML comments are always stripped. This cannot be changed with the allow parameter. Note: This function is binary-safe.
The strip_tags() function is an inbuilt function in PHP which is used to strips a string from HTML, and PHP tags. This function returns a string with all NULL bytes, HTML, and PHP tags stripped from a given $str.
To remove HTML tags in PHP, we can either use the strip_tags() or htmlentities() function: The strip_tags() function will remove all HTML tags. For example, $clean = strip_tags("<p>Foo</p> Bar"); will result in Foo Bar . The htmlentities() function will not remove but convert all symbols into HTML entities.
Strip_tags() is a function that allows you to strip out all HTML and PHP tags from a given string (parameter one), however you can also use parameter two to specify a list of HTML tags you want.
You might be unaware of this, but DOMDocument
can help you fix the HTML.
$html = "<div><h2>Hello world<h2><p>It's 7Am where I live<p><div>";
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML('<root>' . $html . '</root>', LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
foreach( $xpath->query('//*[not(node())]') as $node ) {
$node->parentNode->removeChild($node);
}
echo substr($dom->saveHTML(), 6, -8);
See IDEONE demo
Result: <div><h2>Hello world</h2><p>It's 7Am where I live</p></div>
Note that the XPath-based empty node cleanup is necessary as the DOM contains empty <h2></h2>
, <p></p>
and <div></div>
tags after loading HTML into DOM.
The <root>
element is added in the beginning to make sure we get the root element alright. Later, we can post-process it with substr
.
The LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD
flags are necessary so that no DTD and other rubbish were not added to the DOM.
Finding unmatched tags seems fundamentally too hard to do with a regex. You basically need to put each opening tag to you see onto a queue and then pop it off of the queue when you see the closing tag.
Recommend you use a library that does HTML validation. See these questions:
Remove unmatched HTML tags in a string
How to find the unclosed div tag
PHP get all unclosed HTML tags in string
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With