Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parse HTML in PHP?

I know we can use PHP DOM to parse HTML using PHP. I found a lot of questions here on Stack Overflow too. But I have a specific requirement. I have an HTML content like below

<p class="Heading1-P">
    <span class="Heading1-H">Chapter 1</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 1</span>
</p>
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 2</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 2</span>
</p>
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 3</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 3</span>
</p>

I want to parse the above HTML and save the content into two different arrays like:

$heading and $content

$heading = array('Chapter 1','Chapter 2','Chapter 3');
$content = array('This is chapter 1','This is chapter 2','This is chapter 3');

I can achieve this simply using jQuery. But I am not sure, if that's the right way. It would be great if someone can point me to the right direction. Thanks in advance.

like image 578
laradev Avatar asked Aug 21 '13 04:08

laradev


People also ask

Can PHP read HTML file?

Make a PHP file to read HTML content from a text filetxt' file in read mode and then use fread() function to display file content. You may also like read and delete file from folder using PHP. That's all, this is how to read HTML content from text file using PHP.

Can you parse HTML?

HTML parsing involves tokenization and tree construction. HTML tokens include start and end tags, as well as attribute names and values. If the document is well-formed, parsing it is straightforward and faster. The parser parses tokenized input into the document, building up the document tree.

What is parsing in PHP?

Definition and Usage. The parse_str() function parses a query string into variables. Note: If the array parameter is not set, variables set by this function will overwrite existing variables of the same name. Note: The magic_quotes_gpc setting in the php.

What is the best HTML parser?

The best performers are Golang and C with very similar results. Python LIBXML2 performs fairly well. Ruby speed is similar to Python.


2 Answers

I have used domdocument and domxpath to get the solution, you can find it at:

<?php
$dom = new DomDocument();
$test='<p class="Heading1-P">
    <span class="Heading1-H">Chapter 1</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 1</span>
</p>
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 2</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 2</span>
</p>
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 3</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 3</span>
</p>';

$dom->loadHTML($test);
$xpath = new DOMXpath($dom);
    $heading=parseToArray($xpath,'Heading1-H');
    $content=parseToArray($xpath,'Normal-H');

var_dump($heading);
echo "<br/>";
var_dump($content);
echo "<br/>";

function parseToArray($xpath,$class)
{
    $xpathquery="//span[@class='".$class."']";
    $elements = $xpath->query($xpathquery);

    if (!is_null($elements)) {  
        $resultarray=array();
        foreach ($elements as $element) {
            $nodes = $element->childNodes;
            foreach ($nodes as $node) {
              $resultarray[] = $node->nodeValue;
            }
        }
        return $resultarray;
    }
}

Live result: http://saji89.codepad.org/2TyOAibZ

like image 108
saji89 Avatar answered Sep 19 '22 02:09

saji89


Try to look at PHP Simple HTML DOM Parser

It has brilliant syntax similar to jQuery so you can easily select any element you want by ID or class

// include/require the simple html dom parser file

$html_string = '
    <p class="Heading1-P">
        <span class="Heading1-H">Chapter 1</span>
    </p>
    <p class="Normal-P">
        <span class="Normal-H">This is chapter 1</span>
    </p>
    <p class="Heading1-P">
        <span class="Heading1-H">Chapter 2</span>
    </p>
    <p class="Normal-P">
        <span class="Normal-H">This is chapter 2</span>
    </p>
    <p class="Heading1-P">
        <span class="Heading1-H">Chapter 3</span>
    </p>
    <p class="Normal-P">
        <span class="Normal-H">This is chapter 3</span>
    </p>';
$html = str_get_html($html_string);
foreach($html->find('span') as $element) {
    if ($element->class === 'Heading1-H') {
        $heading[] = $element->innertext;
    }else if($element->class === 'Normal-H') {
        $content[] = $element->innertext;
    }
}
like image 35
Paul Denisevich Avatar answered Sep 21 '22 02:09

Paul Denisevich