From a string that contains a lot of HTML, how can I extract all the text from <h1><h2>etc
tags into a new variable?
I would like to capture all of the text from these elements and store them in a new variable as comma-delimited values.
Is it possible using preg_match_all()
?
First you need to clean up the HTML ($html_str in the example) with tidy:
$tidy_config = array(
"indent" => true,
"output-xml" => true,
"output-xhtml" => false,
"drop-empty-paras" => false,
"hide-comments" => true,
"numeric-entities" => true,
"doctype" => "omit",
"char-encoding" => "utf8",
"repeated-attributes" => "keep-last"
);
$xml_str = tidy_repair_string($html_str, $tidy_config);
Then you can load the XML ($xml_str) into a DOMDocument:
$doc = DOMDocument::loadXML($xml_str);
And finally you can use Horia Dragomir's method:
$list = $doc->getElementsByTagName("h1");
for ($i = 0; $i < $list->length; $i++) {
print($list->item($i)->nodeValue . "<br/>\n");
}
Or you could also use XPath for more complex queries on the DOMDocument (see http://www.php.net/manual/en/class.domxpath.php)
$xpath = new DOMXPath($doc);
$list = $xpath->evaluate("//h1");
You're probably better of using an HTML parser. But for really simple scenarios, something like this might do:
if (preg_match_all('/<h\d>([^<]*)<\/h\d>/iU', $str, $matches)) {
// $matches contains all instances of h1-h6
}
I know this is a super old post, however I wanted to mention the best way I was able to collectively grab heading tags.
<h1>title</h1> and <h2>title 2</h2>
This method (works as a regex, however PHP acts a bit differently.)
/<\s*h[1-2](?:.*)>(.*)</\s*h/i
use this in your preg_match
|<\s*h[1-2](?:.*)>(.*)</\s*h|Ui
$group[1]
will include what ever is in between the heading tag.
$group[0]
is everything <h1>test</h
This will account for spaces, and if someone adds "class/id"
<h1 class="classname">test</h1>
the class/id (group) is ignored.
NOTE: When I analyze HTML tags, I always strip out and replace all White space, line breaks, tabs etc.. with a 1 space. This minimizes multi-lines, dotalls... And very large amounts of white space which in some cases can mess with regex formatting.
Here is a link to the test page regex test
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With