Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Match unclosed html tags using regex and php

Tags:

regex

php

I am using php and regex to find unclosed html tags in a string :

This is my string :

$s="<div><h2>Hello world<h2><p>It's 7Am where I live<p><div>";

You can see All tags here are not closed.

I want to find all unclosed tags, but the problem is that my regex is matching opening tags also.

Here is my regex so far

/<[^>]+>/i

And this is my preg_match_all() function

preg_match_all("/<[^>]+>/i",$s,$v);

print_r($v);

What do I need to change in my regex to match only the unclosed tags?

 <h2>
 <p>
 <div>
like image 436
Amit Verma Avatar asked Nov 24 '15 20:11

Amit Verma


People also ask

How to strip HTML tags from string in PHP?

The strip_tags() function strips a string from HTML, XML, and PHP tags. Note: HTML comments are always stripped. This cannot be changed with the allow parameter. Note: This function is binary-safe.

What is the use of strip_ tags?

The strip_tags() function is an inbuilt function in PHP which is used to strips a string from HTML, and PHP tags. This function returns a string with all NULL bytes, HTML, and PHP tags stripped from a given $str.

How to remove HTML tags in PHP from mysql?

To remove HTML tags in PHP, we can either use the strip_tags() or htmlentities() function: The strip_tags() function will remove all HTML tags. For example, $clean = strip_tags("<p>Foo</p> Bar"); will result in Foo Bar . The htmlentities() function will not remove but convert all symbols into HTML entities.

Is it possible to remove HTML tags from data?

Strip_tags() is a function that allows you to strip out all HTML and PHP tags from a given string (parameter one), however you can also use parameter two to specify a list of HTML tags you want.


2 Answers

You might be unaware of this, but DOMDocument can help you fix the HTML.

$html = "<div><h2>Hello world<h2><p>It's 7Am where I live<p><div>";
libxml_use_internal_errors(true);

$dom = new DOMDocument();
$dom->loadHTML('<root>' . $html . '</root>', LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);

foreach( $xpath->query('//*[not(node())]') as $node ) {
    $node->parentNode->removeChild($node);
}
echo substr($dom->saveHTML(), 6, -8);

See IDEONE demo

Result: <div><h2>Hello world</h2><p>It's 7Am where I live</p></div>

Note that the XPath-based empty node cleanup is necessary as the DOM contains empty <h2></h2>, <p></p> and <div></div> tags after loading HTML into DOM.

The <root> element is added in the beginning to make sure we get the root element alright. Later, we can post-process it with substr.

The LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD flags are necessary so that no DTD and other rubbish were not added to the DOM.

like image 67
Wiktor Stribiżew Avatar answered Oct 08 '22 13:10

Wiktor Stribiżew


Finding unmatched tags seems fundamentally too hard to do with a regex. You basically need to put each opening tag to you see onto a queue and then pop it off of the queue when you see the closing tag.

Recommend you use a library that does HTML validation. See these questions:

Remove unmatched HTML tags in a string

How to find the unclosed div tag

PHP get all unclosed HTML tags in string

like image 27
Cargo23 Avatar answered Oct 08 '22 15:10

Cargo23