Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to avoid DOM parsing adding html doctype, <head> and <body> tags?

Tags:

dom

php

parsing

<?
    $string = '
    Some photos<br>
    <span class="naslov_slike">photo_by_ile_IMG_1676-01</span><br />
    <span class="naslov_slike">photo_by_ile_IMG_1699-01</span><br />
    <span class="naslov_slike">photo_by_ile_IMG_1697-01</span><br />
    <span class="naslov_slike">photo_by_ile_IMG_1695-01</span><br />    
    ';

    $dom = new DOMDocument();
    $dom->loadHTML($string);
    $dom->preserveWhiteSpace = false;
    $elements = $dom->getElementsByTagName('span');
    $spans = array();
    foreach($elements as $span) {
        $spans[] = $span;
    }
    foreach($spans as $span) {
        $span->parentNode->removeChild($span);
    }
    echo $dom->saveHTML();


?>

I'm using this code to parse strings. When string is returned by this function, it has some added tags:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>Some photos<br><br><br><br><br></p></body></html>

Is there any way to avoid this and to have clean string returned? This input string is just for example, in usage it can be any html string.

like image 370
ilija veselica Avatar asked Oct 06 '09 21:10

ilija veselica


People also ask

Is doctype HTML a tag if yes why if no why?

a simple google search answers this question - the doctype is a declaration and not an an element tag. It simply informs the browser about the page and how to render it - specifically that is a html5 configured document.

Is doctype HTML a tag?

Definition and Usage DOCTYPE> declaration. The declaration is not an HTML tag. It is an "information" to the browser about what document type to expect.

Why do we need an HTML tag?

An HTML tag is a piece of markup language used to indicate the beginning and end of an HTML element in an HTML document. As part of an HTML element, HTML tags help web browsers convert HTML documents into web pages.


1 Answers

Hey why not answer a 9 year old question? PHP version 5.4 (released 3 years after this question was asked) added the options parameter to DomDocument::loadHTML(). With it you can do this:

$dom = new DomDocument();
$dom->loadHTML($string, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
// do stuff
echo $dom->saveHTML();

We pass two constants: LIBXML_HTML_NODEFDTD says not to add a document type definition, and LIBXML_HTML_NOIMPLIED says not to add implied elements like <html> and <body>.

like image 96
miken32 Avatar answered Oct 30 '22 11:10

miken32