I'm trying to capture home_impact and away_impact but when i extract the text its full of blank lineas, white spaces, break lines and such like this:
David Luiz
35'
36'
De Gea
I've also tried by extract just the div id match_info but it generates just an array with one element and also it has a lot of line breaks. I've tried using preserveWhiteSpace and preg_replace but didn't work, any idea how to avoid that? Thanks.
Html:
<div id="match_info">
<div class="direct_line">
<div class="home_impact"><div class='player_name'>David Luiz </div></div>
<div class="minute">35'</div>
<div class="away_impact">
</div>
</div>
<div class="direct_line">
<div class="home_impact"></div>
<div class="minute">36'</div>
<div class="away_impact">
<div class='player_name'>De Gea</div>
</div>
</div>
<div class="direct_line">
<div class="home_impact"></div>
<div class="minute">38'</div>
<div class="away_impact">
<div class='player_name'>Ashley Cole</div>
</div>
<div class="home_impact"><div class='player_name'>Juan Mata</div>/div>
<div class="minute">35'</div>
<div class="away_impact">
</div>
</div>
PHP:
$html = file_get_contents($url);
$doc = new DOMDocument();
//$doc->preserveWhiteSpace = FALSE;
@$doc->loadHTML($html);
$xpath = new DOMXpath ($doc);
$expresionHome="//div[@class='home_impact']";
$expresionAway="//div[@class='away_impact']";
$nodesHome = $xpath->evaluate($expresionHome);
$nodesAway = $xpath->evaluate($expresionAway);
for ($i=0;$i<$nodesHome->length;$i++)
{
echo $nodesHome->item($i)->nodeValue;
echo $nodesAway->item($i)->nodeValue;
}
You can use DOMDocument only without any trimming of node content or using regular expressions. Consider following example, please pay attention to DOMDocument properties preserveWhiteSpace and formatOutput (if you want to pretty-print it)
// DOMDocument with unformatted content
$unformatteddocument= new DOMDocument("1.0", "utf-8");
$unformatteddocument->load(PATH_OF_UNFORMATTED_XML);
$document = new DOMDocument("1.0", "utf-8");
$document->preserveWhiteSpace = false;
$document->formatOutput = true;
$document->loadXML($unformatteddocument->saveXML());
$document->save(PATH_FOR_FORMATTED_XML);
Normalizing whitespace in PHP wiht UTF-8 encoding which is how DOMDocument in PHP returns strings:
$normalized = preg_replace(['(\s+)u', '(^\s|\s$)u'], [' ', ''], $text);
That is first reducing whitespace occurences into a single space each and then trimming space at the beginning or end of the string.
Compare with 2.10 White Space Handling from the XML standard.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With