Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I use the PHP Simple HTML DOM Parser to parse this?

Tags:

html

php

parsing

Here is an example of the HTML I need to parse into a PHP program:

                    <div id="dump-list">    
<div class="dump-row"> 
 <div class="dump-location odd" data-jmapping="{id: 35, point: {lng: -73.00898601, lat: 41.71727402}, category: 'office'}">

    <div class="SingleLinkNoTx">
    <a href="#10" class="loc-link">Acme Software</a><br/><strong>John Doe, MBA</strong><br/>123 Main St.<br />New York, NY 10036<br /><strong class="telephone">(212) 555-1234</strong><br/>
    </div><!-- END.SingleLinkNoTx -->

    <a href="http://www.example.com" target="_blank" class="web_link">Visit Website</a><span><br />(0.3 miles)</span>   
    <div class="loc-info">
            <div class="loc-info-text ">
        John Doe, MBA<br /><a href="http://maps.google.com/?daddr=41.71727402,-73.00898601" target="_blank">Get Directions &raquo;</a>    
        </div>

    </div>

</div>

This is the information I want to extract from the above HTML example into PHP:

lng: -73.00898601, lat: 41.71727402
category: 'office'
Acme Software
John Doe, MBA
123 Main St.
New York, NY 10036
(212) 555-1234
http://www.example.com

I have tried using PHP Simple HTML DOM Parser, but I'm new to it and can't find a working PHP example that pertains to what I need to do. I tried some PHP code like this to understand how this works, but the var_dump($e) produces huge amounts of output and has messages in the var_dump about recursion. So I'm lost how to really use this. Greatly appreciate some kind help!

$e = $html->find('.dump-location', 0)->find('.SingleLinkNoTx', 0);
echo $e;
var_dump($e);
like image 884
Edward Avatar asked Feb 17 '26 14:02

Edward


1 Answers

Use XPath to find and extract elements in an HTML/XML document - specifically the SimpleXMLElement::xpath method.

The following example will find the telephone number for a location:

$doc = new DOMDocument();
$doc->loadHTML('your html snippet goes here - or use loadHTMLFile()');
$xml = simplexml_import_dom($doc);
$elements = $xml->xpath('//*[contains(@class, "dump-location")]/div[@class="SingleLinkNoTx"]/strong[@class="telephone"]');
print_r($elements);

The most complex part is the XPath expression. A quick breakdown:

  1. //
  • This rule tells the parser to recursively apply rules to all elements in the document.
  1. *[contains(@class, "dump-location")]
  • Matches any element that has the dump-location class
  1. /
  • Tells the parser to apply the next rule only to elements that have a dump-location parent.
  1. div[@class="SingleLinkNoTx"]
  • Matches any DIV element that has a SingleLinkNoTx class (and no other class name).
  1. strong
  • Rule that matches all the STRONG tags with a telephone class.

Using this XPath expression on the HTML snippet provided in the question will result in output like the following. Which is fairly easy to iterate and extract information from:

Array
(
    [0] => SimpleXMLElement Object
        (
            [@attributes] => Array
                (
                    [class] => telephone
                )

            [0] => (212) 555-1234
        )

)

If you know the document structure it's possible to construct an XPath expression for each piece of information you want to extract. Or, it might be simpler to use a more general XPath expression (say, an expression that retrieves all dump-location elements) and manually iterate though the elements.

like image 84
leepowers Avatar answered Feb 20 '26 04:02

leepowers