Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Preserve line breaks inside <p> tags using DOMXPath?

Tags:

html

dom

php

xpath

I'm currently using PHP and DOMXPath to get the contents of all of the <p> elements of a web page:

<?php
...    
$doc = new DOMDocument();
$doc->loadHTML($html);

$xpath = new DOMXPath($doc);
$paragraphs = $xpath->evaluate("/html/body//p");

foreach ($paragraphs as $paragraph){
echo $paragraph->textContent . "<br />";
}

My problem is that the string resulting from textContent does not respect <br /> tags that exist within those <p> elements. Instead it removes the line break and pushes words together that would normally be on separate lines. For example:

Sample HTML:

<p>
Some happy talk goes here talking about our great product.<br />
We would love for you to buy it!
</p>

<p>
Random information and what not<br />
Isn't that cool?
</p>

Current Output from PHP above:

Some happy talk about our great product.We would love for you to buy it!

Random information and what notIsn't that cool?

I have tried $paragraphs = $doc->getElementsByTagName("p"); as well and it gives me the same thing.

Is there a way to make DOMXPath/DOMDocument preserve the line breaks? I need to be able to separate each of the words within a paragraph, and the current output disallows that.

If there is an alternative method for retrieving the string within <p> elements while preserving <br /> or '\n' that would also be great.

EDIT


Upon further investigation the HTML in question is actually a list of anchors separated by <br> tags but with no actual line breaks:

<p class="home_page_list"><a href="/home/personal-banking/checking/Category-Page-Classic-Checking/classic-checking.html">Classic Checking</a><br> <a href="/home/personal-banking/checking/Category-Page-Interest-Checking/interest-checking.html">Interest Checking</a><br> <a href="/home/personal-banking/checking/Category-Page-Interest-Checking/interest-premium-checking.html">Premium Checking</a><br> <a href="/home/personal-banking/Savings-Category-Page/Basic-Savings-Category-Page/basic-savings.html">Savings Plans</a><br> <a href="/home/personal-banking/Savings-Category-Page/Money-Market-Accounts-Category-Page/money-market-accounts.html">Money Market Accounts</a><br> <a href="/home/personal-banking/Savings-Category-Page/Certificates-of-Deposit-Category-Page/fixed-rate-CD.html">CDs</a><br> <a href="/home/personal-banking/Savings-Category-Page/Individual-Retirement-Account-Category-Page/individual-retirement-account.html">IRAs</a></p>

Turns out that this works properly with the original HTML given.

UPDATE: Solved


With the help of @ircmaxell's answer, and the comments left by @netcoder and @Gordon this has been solved, it's not very elegant but it will do for now.

Example:

foreach ($paragraphs as $paragraph){
    $p_text = new DOMDocument();
    $p_text->loadHTML(str_ireplace(array("<br>", "<br />"), "\r\n", DOMinnerHTML($paragraph)));
    //Do whatever, in this case get all of the words in an array.
    $words = explode(" ", str_ireplace(array(",", ".", "&", ":", "-", "\r\n"), " ", $p_text->textContent));
print_r($words);
}

This makes use of DOMinnerHTML (as suggested by @netcoder) to replace the instances of <br> with "\r\n" (as suggested by @ircmaxell), which can then be evaluated post textContent.

Obviously there's some room for improvement, but it has solved my current issue.

Thanks for the help everyone,

Ben

like image 686
Ben L. Avatar asked Jan 19 '11 19:01

Ben L.


People also ask

How do I preserve line breaks when getting text from a textarea?

Preserve Newlines, Line Breaks, and Whitespace in HTML If you want your text to overflow the parent's boundaries, you should use pre as your CSS whitespace property. Using white-space: pre wraps still preserves newlines and spaces.

How to add 2 line breaks in HTML?

To add a line break to your HTML code, you use the <br> tag. The <br> tag does not have an end tag. You can also add additional lines between paragraphs by using the <br> tags. Each <br> tag you enter creates another blank line.

How to force a paragraph break in HTML?

To do a line break in HTML, use the <br> tag. Simply place the tag wherever you want to force a line break.

How to add br tag in HTML?

HTML <br> Tag If you want to start a new line, you need to insert a line break with the help of the <br>. The <br> tag inserts a single carriage return or breaks in the document. This element has no end tag. Example: In this example, we use <br> tag in p tag to line break the content.


1 Answers

Well, what I would do is replace the line-breaks with literal linebreaks:

$doc = new DOMDocument();
$doc->loadHTML($html);

$brs = $doc->getElementsByTagName('br');
foreach ($brs as $node) {
    $node->parentNode->replaceChild($doc->createTextNode("\r\n"), $node);
}


$xpath = new DOMXPath($doc);
$paragraphs = $xpath->evaluate("/html/body//p");

foreach ($paragraphs as $paragraph){
    echo $paragraph->textContent . "<br />";
}
like image 92
ircmaxell Avatar answered Oct 23 '22 16:10

ircmaxell