Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parse html more elegantly in PHP?

Tags:

php

simple HTML code is here.

<table>

<tr><th>Name</th><th>Price</th><th>Country</th></tr>
<tr><td><a href="bbb/111">Apple</a></td><td>500</td><td>America</td></tr>
<tr><td><a href="bbb/222">Samsung</a></td><td>400</td><td>Korea</td></tr>
<tr><td><a href="bbb/333">Nokia</a></td><td>300</td><td>Finland</td></tr>
<tr><td><a href="bbb/444">HTC</a></td><td>200</td><td>Taiwan</td></tr>
<tr><td><a href="bbb/555">Blackberry</a></td><td>100</td><td>America</td></tr>

</table>

What I want to do is scrapping company name, and its price. like this.

Apple 500 / Samsung 400 / Nokia 300 / HTC 200 / Blackberry 100 

So, I use php dom parser. I know there are many php parser plugin, but people say it is better to use original php parser. so I code like this.

$source_n = file_get_contents($html);
$dom = new DOMDocument();
@$dom->loadHTML($source_n);
$stacks =  $dom->getElementsByTagName('table')->item(0)->textContent;
echo $stacks; 

it is will shown many string values.... like this.

Name Price Country Apple 500 America Samsung 400 Korea ......

It is very I think, not useful coding, if I code like above, I should use explode() function, and code will more dirty than now.

How can I scrapping more elegantly? is there any easy reference?

like image 900
ton1 Avatar asked Jun 28 '15 12:06

ton1


2 Answers

Use DOMXPath::query, gather all names first

$selector = new DOMXPath($dom);

$results = $selector->query('//td/a');

foreach($results as $node) {
    echo $node->nodeValue . PHP_EOL;
}

Then, prices after, by changing

$results = $selector->query('//td[2]');

Sandbox sample here

like image 68
viral Avatar answered Oct 03 '22 19:10

viral


The best solution I found for parsing html is using symfony's Dom crawler component. Together with the css selector, you can filter HTML like you would select a class in javascript. For example to get all p elements, do:

$crawler = $crawler->filter('body > p');
like image 22
baao Avatar answered Oct 03 '22 20:10

baao