From a string that contains a lot of HTML, how can I extract all the text from <h1><h2>etc tags into a new variable?
I would like to capture all of the text from these elements and store them in a new variable as comma-delimited values.
Is it possible using preg_match_all()?
First you need to clean up the HTML ($html_str in the example) with tidy:
$tidy_config = array(
    "indent"               => true,
    "output-xml"           => true,
    "output-xhtml"         => false,
    "drop-empty-paras"     => false,
    "hide-comments"        => true,
    "numeric-entities"     => true,
    "doctype"              => "omit",
    "char-encoding"        => "utf8",
    "repeated-attributes"  => "keep-last"
);
$xml_str = tidy_repair_string($html_str, $tidy_config);
Then you can load the XML ($xml_str) into a DOMDocument:
$doc = DOMDocument::loadXML($xml_str);
And finally you can use Horia Dragomir's method:
$list = $doc->getElementsByTagName("h1");
for ($i = 0; $i < $list->length; $i++) {
    print($list->item($i)->nodeValue . "<br/>\n");
}
Or you could also use XPath for more complex queries on the DOMDocument (see http://www.php.net/manual/en/class.domxpath.php)
$xpath = new DOMXPath($doc);
$list = $xpath->evaluate("//h1");
                        You're probably better of using an HTML parser. But for really simple scenarios, something like this might do:
if (preg_match_all('/<h\d>([^<]*)<\/h\d>/iU', $str, $matches)) {
    // $matches contains all instances of h1-h6
}
                        I know this is a super old post, however I wanted to mention the best way I was able to collectively grab heading tags.
<h1>title</h1> and <h2>title 2</h2>
This method (works as a regex, however PHP acts a bit differently.)
/<\s*h[1-2](?:.*)>(.*)</\s*h/i
use this in your preg_match
|<\s*h[1-2](?:.*)>(.*)</\s*h|Ui
$group[1] will include what ever is in between the heading tag. 
 $group[0] is everything <h1>test</h
This will account for spaces, and if someone adds "class/id"
<h1 class="classname">test</h1>
the class/id (group) is ignored.
NOTE: When I analyze HTML tags, I always strip out and replace all White space, line breaks, tabs etc.. with a 1 space. This minimizes multi-lines, dotalls... And very large amounts of white space which in some cases can mess with regex formatting.
Here is a link to the test page regex test
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With