Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Preserving (or restoring) whitespace in TextContent

Tags:

c#

anglesharp

Using AngleSharp to process some HTML and extract the text content of an element for later mining, I've run into a problem with the way AngleSharp strips out the HTML tags. For example, I have a piece of HTML something like this (minus the newlines and tabs):

<div id="someID">
    blah, blah, blah, blah
    blah, blah, 
    <ul>
        <li><i>action.</i></li>
        <li><i>Typical, blah, blah, blah</li>
    </ul>
    blah, blah, blah
</div>

The problem here, is when I get the TextContent:

var content = someDiv.TextContext;

It'll come out like this:

"...blah, blah, action.Typical blah, blah..."

The words action and Typical have been smashed together without any whitespace (because the only thing between them are html tags). This is tripping up my efforts to then tokenize the text content because action.Typical is seen as a single word instead of two words.

I could, of course, just run a search and replace (probably using a regex), something like (\S)\.(\S) and replace it with $1. $2 but then that would take something like www.somecompany.com and split it up into www, somecompany and com and I might want to preserve that (or failing that www and com aren't likely to be very useful anyway by themselves). I could exclude words with more than one dot, but a web address might appear as somecompany.com (without the www) or you might encounter an email address like [email protected].

Is there a robust way around this? To preserve at least one space after the tags have been stripped out?

like image 747
Matt Burland Avatar asked Nov 01 '25 23:11

Matt Burland


1 Answers

So it seems like the best way to fix this is to recurse down the ChildNodes (not Children which misses text nodes) of the root element and then join them altogether again. So, given:

var rootElem = myDoc.GetElementById("someId");

I can create a function like this:

IEnumerable<string> ExtractChildNodes(INode node)
{
    if (node.HasChildNodes)
    {
        foreach (var c in node.ChildNodes)
        {
            foreach (var r in ExtractChildNodes(c))
            {
                yield return r;
            }
        }
    }
    else
    {
        yield return node.TextContent;
    }
}

That will test if a node has child nodes and if it does drill-down to the lowest leaf node and return the text from there. I can then do this:

var textContentWithSpacesBetweenNodes = string.Join(" ", ExtractChildNodes(rootElem));

And that should give me:

"...blah, blah, action. Typical blah, blah..."

With the space between action and Typical.

This seems to cope with not just situations like <p>some.</p><p>words</p> but also self-closing tags like some</br>words or even some<br>words which would be a massive pain to deal with using a regex or something similar.

like image 153
Matt Burland Avatar answered Nov 03 '25 13:11

Matt Burland



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!