Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Select nodeValue but exclude child elements

Let's say I have this code:

<p dataname="description">
Hello this is a description. <a href="#">Click here for more.</a>
</p>

How do I select the nodeValue of p but exclude a and it's content?

My current code:

$result = $xpath->query("//p[@dataname='description'][not(self::a)]");

I select it by $result->item(0)->nodeValue;

like image 898
Jürgen Paul Avatar asked Feb 08 '12 11:02

Jürgen Paul


2 Answers

Simply appending /text() to your query should do the trick

$result = $xpath->query("//p[@dataname='description'][not(self::a)]/text()");
like image 52
Kristofer Avatar answered Oct 21 '22 17:10

Kristofer


Unsure if PHP's XPath supports this, but this XPath does the trick for me in Scrapy (Python based scraping framework):

$xpath->query('//p[@dataname='description']/text()[following-sibling::a]')

If this doesn't work, try Kristoffers solution, or you could also use a regex solution. For example:

$output = preg_replace("~<.*?>.*?<.*?>~msi", '', $result->item(0)->nodeValue);

That'll remove any HTML tag with any content in it, excluding text which is not encapsulated by HTML tags.

like image 23
Sjaak Trekhaak Avatar answered Oct 21 '22 18:10

Sjaak Trekhaak