Remove whitespaces and line breaks from captured data with Php Dom Document

Question

I'm trying to capture home_impact and away_impact but when i extract the text its full of blank lineas, white spaces, break lines and such like this:

  David Luiz 
        35'






        36'

            De Gea

I've also tried by extract just the div id match_info but it generates just an array with one element and also it has a lot of line breaks. I've tried using preserveWhiteSpace and preg_replace but didn't work, any idea how to avoid that? Thanks.

Html:

   <div id="match_info">
                           <div class="direct_line">
            <div class="home_impact"><div class='player_name'>David Luiz </div></div>
                <div class="minute">35'</div>
                <div class="away_impact">
                </div>
        </div> 
               <div class="direct_line">
            <div class="home_impact"></div>
                <div class="minute">36'</div>
                <div class="away_impact">
                    <div class='player_name'>De Gea</div>
                </div>
        </div> 
                <div class="direct_line">
            <div class="home_impact"></div>
                <div class="minute">38'</div>
                <div class="away_impact">
                    <div class='player_name'>Ashley Cole</div>
                </div>
               <div class="home_impact"><div class='player_name'>Juan Mata</div>/div>
                <div class="minute">35'</div>
                <div class="away_impact">
                </div>
        </div>

PHP:

$html = file_get_contents($url);
$doc = new DOMDocument();
//$doc->preserveWhiteSpace = FALSE;
@$doc->loadHTML($html);
$xpath = new DOMXpath ($doc);
$expresionHome="//div[@class='home_impact']";
$expresionAway="//div[@class='away_impact']";
$nodesHome = $xpath->evaluate($expresionHome);
$nodesAway = $xpath->evaluate($expresionAway);
for ($i=0;$i<$nodesHome->length;$i++)
{
echo $nodesHome->item($i)->nodeValue;
echo $nodesAway->item($i)->nodeValue;
}

Aleksandr Shumilov · Accepted Answer

You can use DOMDocument only without any trimming of node content or using regular expressions. Consider following example, please pay attention to DOMDocument properties preserveWhiteSpace and formatOutput (if you want to pretty-print it)

// DOMDocument with unformatted content
$unformatteddocument= new DOMDocument("1.0", "utf-8");
$unformatteddocument->load(PATH_OF_UNFORMATTED_XML);

$document = new DOMDocument("1.0", "utf-8");
$document->preserveWhiteSpace = false;
$document->formatOutput = true;
$document->loadXML($unformatteddocument->saveXML());
$document->save(PATH_FOR_FORMATTED_XML);

hakre · Answer

Normalizing whitespace in PHP wiht UTF-8 encoding which is how DOMDocument in PHP returns strings:

$normalized = preg_replace(['(\s+)u', '(^\s|\s$)u'], [' ', ''], $text);

That is first reducing whitespace occurences into a single space each and then trimming space at the beginning or end of the string.

Compare with 2.10 White Space Handling from the XML standard.

Remove whitespaces and line breaks from captured data with Php Dom Document

Tags:

php

preg-replace

domdocument

Marx

2 Answers

Aleksandr Shumilov

hakre

Recent Activity

Donate For Us

Remove whitespaces and line breaks from captured data with Php Dom Document

Tags:

php

preg-replace

domdocument

Marx

2 Answers

Aleksandr Shumilov

hakre

Related questions

Recent Activity

Donate For Us