Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Xpath with PHP to parse HTML

Tags:

php

xpath

I'm currently trying to parse some data from a forum. Here is the code:

$xml = simplexml_load_file('https://forums.eveonline.com');

$names = $xml->xpath("html/body/div/div/form/div/div/div/div/div[*]/div/div/table//tr/td[@class='topicViews']");
foreach($names as $name) 
{
    echo $name . "<br/>";
}

Anyway, the problem is that I'm using google xpath extension to help me get the path, and I'm guessing that google is changing the html enough to make it not come up when i use my website to do this search. Is there some type of way I can make the host look at the site through google chrome so that it gets the right code? What would you suggest?

Thanks!

like image 415
VixenSoul Avatar asked Dec 05 '12 07:12

VixenSoul


2 Answers

My suggestion is to always use DOMDocument as opposed to SimpleXML, since it's a much nicer interface to work with and makes tasks a lot more intuitive.

The following example shows you how to load the HTML into the DOMDocument object and query the DOM using XPath. All you really need to do is find all td elements with a class name of topicViews and this will output each of the nodeValue members found in the DOMNodeList returned by this XPath query.

/* Use internal libxml errors -- turn on in production, off for debugging */
libxml_use_internal_errors(true);
/* Createa a new DomDocument object */
$dom = new DomDocument;
/* Load the HTML */
$dom->loadHTMLFile("https://forums.eveonline.com");
/* Create a new XPath object */
$xpath = new DomXPath($dom);
/* Query all <td> nodes containing specified class name */
$nodes = $xpath->query("//td[@class='topicViews']");
/* Set HTTP response header to plain text for debugging output */
header("Content-type: text/plain");
/* Traverse the DOMNodeList object to output each DomNode's nodeValue */
foreach ($nodes as $i => $node) {
    echo "Node($i): ", $node->nodeValue, "\n";
}
like image 184
Sherif Avatar answered Oct 23 '22 04:10

Sherif


A double '/' will make xpath search. So if you would use the xpath '//table' you would get all tables. You can also use this deeper in your xpath structure like 'html/body/div/div/form//table' to get all tables under xpath 'html/body/div/div/form'.

This way you can make your code a bit more resilient against changes in the html source.

I do suggest learning a little about xpath if you want to use it. Copy paste only gets you so far.

A simple explanation about the syntax can be found at w3schools.com/xml/xpath_syntax.asp

like image 36
Damien Overeem Avatar answered Oct 23 '22 03:10

Damien Overeem