Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove whitespaces and line breaks from captured data with Php Dom Document

I'm trying to capture home_impact and away_impact but when i extract the text its full of blank lineas, white spaces, break lines and such like this:

  David Luiz 
        35'






        36'

            De Gea

I've also tried by extract just the div id match_info but it generates just an array with one element and also it has a lot of line breaks. I've tried using preserveWhiteSpace and preg_replace but didn't work, any idea how to avoid that? Thanks.

Html:

   <div id="match_info">
                           <div class="direct_line">
            <div class="home_impact"><div class='player_name'>David Luiz </div></div>
                <div class="minute">35'</div>
                <div class="away_impact">
                </div>
        </div> 
               <div class="direct_line">
            <div class="home_impact"></div>
                <div class="minute">36'</div>
                <div class="away_impact">
                    <div class='player_name'>De Gea</div>
                </div>
        </div> 
                <div class="direct_line">
            <div class="home_impact"></div>
                <div class="minute">38'</div>
                <div class="away_impact">
                    <div class='player_name'>Ashley Cole</div>
                </div>
               <div class="home_impact"><div class='player_name'>Juan Mata</div>/div>
                <div class="minute">35'</div>
                <div class="away_impact">
                </div>
        </div> 

PHP:

$html = file_get_contents($url);
$doc = new DOMDocument();
//$doc->preserveWhiteSpace = FALSE;
@$doc->loadHTML($html);
$xpath = new DOMXpath ($doc);
$expresionHome="//div[@class='home_impact']";
$expresionAway="//div[@class='away_impact']";
$nodesHome = $xpath->evaluate($expresionHome);
$nodesAway = $xpath->evaluate($expresionAway);
for ($i=0;$i<$nodesHome->length;$i++)
{
echo $nodesHome->item($i)->nodeValue;
echo $nodesAway->item($i)->nodeValue;
}
like image 644
Marx Avatar asked Aug 25 '14 13:08

Marx


2 Answers

You can use DOMDocument only without any trimming of node content or using regular expressions. Consider following example, please pay attention to DOMDocument properties preserveWhiteSpace and formatOutput (if you want to pretty-print it)

// DOMDocument with unformatted content
$unformatteddocument= new DOMDocument("1.0", "utf-8");
$unformatteddocument->load(PATH_OF_UNFORMATTED_XML);

$document = new DOMDocument("1.0", "utf-8");
$document->preserveWhiteSpace = false;
$document->formatOutput = true;
$document->loadXML($unformatteddocument->saveXML());
$document->save(PATH_FOR_FORMATTED_XML);
like image 185
Aleksandr Shumilov Avatar answered Oct 05 '22 23:10

Aleksandr Shumilov


Normalizing whitespace in PHP wiht UTF-8 encoding which is how DOMDocument in PHP returns strings:

$normalized = preg_replace(['(\s+)u', '(^\s|\s$)u'], [' ', ''], $text);

That is first reducing whitespace occurences into a single space each and then trimming space at the beginning or end of the string.

Compare with 2.10 White Space Handling from the XML standard.

like image 23
hakre Avatar answered Oct 05 '22 23:10

hakre