Truncating HTML content at the end of text blocks (block elements)

Tags:

Mainly when we shorten/truncate textual content we usually just truncate it at specific character index. That's already complicated in HTML anyway, but I want to truncate my HTML content (generated using content-editable div) using different measures:

I would define character index N that will serve as truncation startpoint limit
Algorithm will check whether content is at least N characters long (text only; not counting tags); if it's not, it will just return the whole content
It would then check from N-X to N+X character position (text only) and search for ends of block nodes; X is predefined offset value and likely about N/5 to N/4;
If several block nodes end within this range, algorithm will select the one that ends closest to limit index N
If no block node ends within this range it would then find closest word boundary within the same range and select index closest to N and truncate at that position.
Return truncated content with valid HTML (all tags closed at the end)

My content-editable generated content may consist of paragraphs (with line breaks), preformatted code blocks, block quotes, ordered and unordered lists, headers, bolds and italics (which are inline nodes and shouldn't count in truncation process) etc. Final implementation will of course define which elements specifically are possible truncation candidates. Headers even though they are block HTML elements will not count as truncation points as we don't want widowed headers. Paragraphs, list individual items, whole ordered and unordered lists, block quotes, preformatted blocks, void elements etc. are good ones. Headers and all inline block elements aren't.

Example

Let's take this very stackoverflow question as an example of HTML content that I would like to truncate. Let's set truncation limit to 1000 with offset of 250 characters (1/4).

As can be seen from example the closest truncation boundary between two block nodes to character 1000 is between </OL> and P (My content-editable generated...). This means that my HTML should be truncated right between these two tags which would result in a little bit less than 1000 characters long content text wise, but kept truncated content meaningful because it wouldn't just truncate somewhere in the middle of some text passage.

I hope this explains how things should be working related to this algorithm.

The problem

The first problem I'm seeing here is that I'm dealing with nested structure like HTML. I also have to detect different elements (only block elements and no inline ones). And last but not least I will have to only count certain characters in my string and ignore those that belong to tags.

Possible solutions

I could parse my content manually by creating some object tree representing content nodes and their hierarchy
I could convert HTML to something easier to manage like markdown and then simply search for closest new line to my provided index N and convert back to HTML
Use something like HTML Agility Pack and replace my #1 parsing with it and then somehow use XPath to extract block nodes and truncate content

Second thoughts

I'm sure I could make it by doing #1 but it feels I'm reinventing the wheel.
I don't think there's any C# library for #2 so I should be doing HTML to Markdown manually as well or run i.e. pandoc as an external process.
I could use HAP as it's great at manipulating HTML, but I'm not sure whether my truncation would be simple enough by using it. I'm afraid the bulk of processing will still be outside HAP in my custom code

How should one approach such truncation algorithm? My head just seems to be too tired to come to a consensus (or solution).

388

asked Jun 18 '15 22:06

Robert Koritnik

1 Answers

Here is some sample code that can truncate the inner text. It uses the recursive capability of the InnerText property and CloneNode method.

    public static HtmlNode TruncateInnerText(HtmlNode node, int length)
    {
        if (node == null)
            throw new ArgumentNullException("node");

        // nothing to do?
        if (node.InnerText.Length < length)
            return node;

        HtmlNode clone = node.CloneNode(false);
        TruncateInnerText(node, clone, clone, length);
        return clone;
    }

    private static void TruncateInnerText(HtmlNode source, HtmlNode root, HtmlNode current, int length)
    {
        HtmlNode childClone;
        foreach (HtmlNode child in source.ChildNodes)
        {
            // is expected size is ok?
            int expectedSize = child.InnerText.Length + root.InnerText.Length;
            if (expectedSize <= length)
            {
                // yes, just clone the whole hierarchy
                childClone = child.CloneNode(true);
                current.ChildNodes.Add(childClone);
                continue;
            }

            // is it a text node? then crop it
            HtmlTextNode text = child as HtmlTextNode;
            if (text != null)
            {
                int remove = expectedSize - length;
                childClone = root.OwnerDocument.CreateTextNode(text.InnerText.Substring(0, text.InnerText.Length - remove));
                current.ChildNodes.Add(childClone);
                return;
            }

            // it's not a text node, shallow clone and dive in
            childClone = child.CloneNode(false);
            current.ChildNodes.Add(childClone);
            TruncateInnerText(child, root, childClone, length);
        }
    }

And a sample C# console app that will scrap this question as an example, and truncate it to 500 characters.

  class Program
  {
      static void Main(string[] args)
      {
          var web = new HtmlWeb();
          var doc = web.Load("http://stackoverflow.com/questions/30926684/truncating-html-content-at-the-end-of-text-blocks-block-elements");
          var post = doc.DocumentNode.SelectSingleNode("//td[@class='postcell']//div[@class='post-text']");
          var truncated = TruncateInnerText(post, 500);
          Console.WriteLine(truncated.OuterHtml);
          Console.WriteLine("Size: " + truncated.InnerText.Length);
      }
  }

When ran it, it should display this:

<div class="post-text" itemprop="text">

<p>Mainly when we shorten/truncate textual content we usually just truncate it at specific character index. That's already complicated in HTML anyway, but I want to truncate my HTML content (generated using content-editable <code>div</code>) using different measures:</p>

<ol>
<li>I would define character index <code>N</code> that will serve as truncating startpoint <em>limit</em></li>
<li>Algorithm will check whether content is at least <code>N</code> characters long (text only; not counting tags); if it's not it will just return the whole content</li>
<li>It would then</li></ol></div>
Size: 500

Note: I have not truncated at word boundary, just at character boundary, and no, it's not at all following the suggestions in my comment :-)

135

answered Oct 31 '22 23:10

Simon Mourier

Related questions
                            
                                Adding custom SOAPHeader in C# for a web service call
                            
                                ListBox SystemColor for Inactive Item?
                            
                                How to track .Net thread pool usage?
                            
                                Saving only the REAL attachments of an Outlook MailItem
                            
                                What are the EF Rewrite Rules?
                            
                                Why does the LostFocus event get called at different times?
                            
                                How to move validation handling from a controller action to a decorator
                            
                                Why does ServiceStack emit local time even if date was UTC in JSON?
                            
                                IronPython throw InsufficientMemoryException when using numpy in threads
                            
                                MongoDB C# driver aggregation between dates returns null fields
                            
                                C# XML serialization backwards compatibility
                            
                                Managing DbContext in WPF MVVM application
                            
                                "Object reference not set to an instance of an object" - but nothing is null?
                            
                                Why do the StackOverflow platform developers use static methods for performance?
                            
                                C# Method initialization in anonymous types
                            
                                Where to store hashes, salts, keys in Desktop Applications
                            
                                How can I virtualize a "datagrid like" Control Horizontally and Vertically on XAML/C# (Windows 8.1 - WinRT)
                            
                                OrgUnit Not Found using Google Directory API
                            
                                XAML fails to compile, but without any error message, if user-defined object is first resource and followed immediately by x:Array resource
                            
                                How to recognize a WOL (Wake On Lan) request while the PC is running

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Truncating HTML content at the end of text blocks (block elements)

Tags:

html

c#

extract

truncate