Preserving (or restoring) whitespace in TextContent

Question

Using AngleSharp to process some HTML and extract the text content of an element for later mining, I've run into a problem with the way AngleSharp strips out the HTML tags. For example, I have a piece of HTML something like this (minus the newlines and tabs):

<div id="someID">
    blah, blah, blah, blah
    blah, blah, 
    <ul>
        <li><i>action.</i></li>
        <li><i>Typical, blah, blah, blah</li>
    </ul>
    blah, blah, blah
</div>

The problem here, is when I get the TextContent:

var content = someDiv.TextContext;

It'll come out like this:

"...blah, blah, action.Typical blah, blah..."

The words action and Typical have been smashed together without any whitespace (because the only thing between them are html tags). This is tripping up my efforts to then tokenize the text content because action.Typical is seen as a single word instead of two words.

I could, of course, just run a search and replace (probably using a regex), something like (\S)\.(\S) and replace it with $1. $2 but then that would take something like www.somecompany.com and split it up into www, somecompany and com and I might want to preserve that (or failing that www and com aren't likely to be very useful anyway by themselves). I could exclude words with more than one dot, but a web address might appear as somecompany.com (without the www) or you might encounter an email address like [email protected].

Is there a robust way around this? To preserve at least one space after the tags have been stripped out?

Matt Burland · Accepted Answer

So it seems like the best way to fix this is to recurse down the ChildNodes (not Children which misses text nodes) of the root element and then join them altogether again. So, given:

var rootElem = myDoc.GetElementById("someId");

I can create a function like this:

IEnumerable<string> ExtractChildNodes(INode node)
{
    if (node.HasChildNodes)
    {
        foreach (var c in node.ChildNodes)
        {
            foreach (var r in ExtractChildNodes(c))
            {
                yield return r;
            }
        }
    }
    else
    {
        yield return node.TextContent;
    }
}

That will test if a node has child nodes and if it does drill-down to the lowest leaf node and return the text from there. I can then do this:

var textContentWithSpacesBetweenNodes = string.Join(" ", ExtractChildNodes(rootElem));

And that should give me:

"...blah, blah, action. Typical blah, blah..."

With the space between action and Typical.

This seems to cope with not just situations like some.words but also self-closing tags like somewords or even some words which would be a massive pain to deal with using a regex or something similar.

Preserving (or restoring) whitespace in TextContent

Tags:

c#

anglesharp

Matt Burland

1 Answers

Matt Burland

Recent Activity

Donate For Us

Preserving (or restoring) whitespace in TextContent

Tags:

c#

anglesharp

Matt Burland

1 Answers

Matt Burland

Related questions

Recent Activity

Donate For Us