Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Work-around a StackOverflowException

I'm using HtmlAgilityPack to parse roughly 200,000 HTML documents.

I cannot predict the contents of these documents, however one such document causes my application to fail with a StackOverflowException. The document contains this HTML:

<ol>
    <li><li><li><li><li><li>...
</ol>

There are roughly 10,000 <li> elements nested like that. Due to the way HtmlAgilityPack parses HTML it causes a StackOverflowException.

Unfortunately a StackOverflowException is not catchable in .NET 2.0 and later.

I did wonder about setting a larger size for the thread's stack, but setting a larger stack size is a hack: it would cause my program to use a lot more memory (my program starts about 50 threads for processing HTML, so all of these threads would have the increased stack size) and would need manually adjusting if it ever came across a similar situation again.

Are there any other workarounds I could employ?

like image 422
Dai Avatar asked Oct 01 '12 00:10

Dai


People also ask

How do I handle StackOverflowException?

StackOverflowException is thrown for execution stack overflow errors, typically in case of a very deep or unbounded recursion. So make sure your code doesn't have an infinite loop or infinite recursion. StackOverflowException uses the HRESULT COR_E_STACKOVERFLOW, which has the value 0x800703E9.

Can we handle StackOverflowException?

NET Framework 2.0, you can't catch a StackOverflowException object with a try / catch block, and the corresponding process is terminated by default. Consequently, you should write your code to detect and prevent a stack overflow.

What causes StackOverflowException?

A StackOverflowException is thrown when the execution stack overflows because it contains too many nested method calls. using System; namespace temp { class Program { static void Main(string[] args) { Main(args); // Oops, this recursion won't stop. } } }

Can you catch a StackOverflowException Java?

StackOverflowError is an error which Java doesn't allow to catch, for instance, stack running out of space, as it's one of the most common runtime errors one can encounter.


1 Answers

I just patched an error that I believe is the same as your describing. Uploaded the patch to the hap project site...

http://www.codeplex.com/site/users/view/sjdirect (see the patch on 3/8/2012)

Or see more documentation of the issue and result here....

https://code.google.com/p/abot/issues/detail?id=77

The actual fix was... Added HtmlDocument.OptionMaxNestedChildNodes that can be set to prevent StackOverflowExceptions that are caused by tons of nested tags. It will throw an ApplicationException with message "Document has more than X nested tags. This is likely due to the page not closing tags properly."

How I'm Using Hap After Patch...

HtmlDocument hapDoc = new HtmlDocument();
hapDoc.OptionMaxNestedChildNodes = 5000;//This is what was added
string rawContent = GETTHECONTENTHERE
try
{
    hapDoc.LoadHtml(RawContent);    
}
catch (Exception e)
{
    //Instead of a stackoverflow exception you should end up here now
    hapDoc.LoadHtml("");
    _logger.Error(e);
}
like image 155
sjdirect Avatar answered Oct 17 '22 13:10

sjdirect