I've building a command line php scraping app that uses XPath to analyze the HTML - the problem is every time a new DOMXPath class instance gets loaded in a loop I'm getting a memory loss roughly equal to the size of the XML being loaded. The script runs and runs, slowly building up memory usage until it hits the limit and quits.
I've tried forcing garbage collection with gc_collect_cycles()
and PHP still isn't getting back memory from old Xpath requests. Indeed the definition of the DOMXPath class doesn't seem to even include a destructor function?
So my question is ... is there any way to force garbage clean up on DOMXPath
after I've already extracted the necessary data? Using unset on the class instance predictably does nothing.
The code is nothing special, just standard Xpath stuff:
//Loaded outside of loop
$this->dom = new DOMDocument();
//Inside Loop
$this->dom->loadHTML($output);
$xpath = new DOMXPath($this->dom);
$nodes = $xpath->query("//span[@class='ckass']");
//unset($this->dom) and unset($xpath) doesn't seem to have any effect
As you can see above I've kept the instantiation of a new DOMDocument
class outside of the loop, although that doesn't seem to improve performance. I've even tried taking the $xpath
class instance out of the loop and loading the DOM into Xpath directly using the __constructor
method, memory loss is the same.
After seeing this answer is her for years without a conclusion, finally an update! I now ran into a similar problem and it turns out that DOMXPath
just leaks the memory and you can't control it. I have not searched if this has been reported on bug.php.net so far (this could be useful to edit in later).
The "working" solutions I have found to the problem are just workarounds. The basic idea was to replace the DOMNodeList
Traversable
returned by DOMXPath::query()
with a different one containing the same nodes.
A most fitting work-around is with DOMXPathElementsIterator
which allows you to query the concrete xpath expression you have in your question without the memory leaks:
$nodes = new DOMXPathElementsIterator($this->dom, "//span[@class='ckass']");
foreach ($nodes as $span) {
...
}
This class is now part of the development version of Iterator-Garden and $nodes
is an iterator over all the <span>
DOMElements.
The downside of this workaround is that the xpath result is limited to a SimpleXMLElement::xpath()
result (this differs from DOMXPath::query()
) because it's used internally to prevent the memory leak.
Another alternative is to make use of DOMNodeListIterator
over a DOMNodeList
like the one returned by DOMDocument::getElementsByTagname()
. However these iterations are slow.
Hope this is of some use even the question was really old. It helped me in a similar situation.
Calling garbage collection cleanup circles makes only sense if the objects aren't referenced (used) any longer.
For example if you create a new DOMXPath
object for the same DOMDocument
over an over again (keep in mind it's connected to the DOMDocument
that still exists), sounds like being your memory "leak". You just use more and more memory.
Instead you can just re-use the existing DOMXPath
object as you re-use the DOMDocument
object all the time. Give it a try:
//Loaded outside of loop
$this->dom = new DOMDocument();
$xpath = new DOMXPath($this->dom);
//Inside Loop
$this->dom->loadHTML($output);
$nodes = $xpath->query("//span[@class='ckass']");
If you are using libxml_use_internal_errors(true);
than it is the reason of memory leak because error list is growing.
Use libxml_clear_errors();
or check this answer for details.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With