Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

c# Truncate HTML safely for article summary

Tags:

html

c#

regex

Does anyone have a c# variation of this?

This is so I can take some html and display it without breaking as a summary lead in to an article?

Truncate text containing HTML, ignoring tags

Save me from reinventing the wheel!

Edit

Sorry, new here, and your right, should have phrased the question better, heres a bit more info

I wish to take a html string and truncate it to a set number of words (or even char length) so I can then show the start of it as a summary (which then leads to the main article). I wish to preserve the html so I can show the links etc in preview.

The main issue I have to solve is the fact that we may well end up with unclosed html tags if we truncate in the middle of 1 or more tags!

The idea I have for solution is to

  1. truncate the html to N words (words better but chars ok) first (be sure not to stop in the middle of a tag and truncate a require attribute)

  2. work through the opened html tags in this truncated string (maybe stick them on stack as I go?)

  3. then work through the closing tags and ensure they match the ones on stack as I pop them off?

  4. if any open tags left on stack after this, then write them to end of truncated string and html should be good to go!!!!

Edit 12/11/2009

  • Here is what I have bumbled together so far as a unittest file in VS2008, this 'may' help someone in future
  • My hack attempts based on Jan code are at top for char version + word version (DISCLAIMER: this is dirty rough code!! on my part)
  • I assume working with 'well-formed' HTML in all cases (but not necessarily a full document with a root node as per XML version)
  • Abels XML version is at bottom, but not yet got round to fully getting tests to run on this yet (plus need to understand the code) ...
  • I will update when I get chance to refine
  • having trouble with posting code? is there no upload facility on stack?

Thanks for all comments :)

using System;
using System.Collections.Generic;
using System.Text.RegularExpressions;
using System.Xml;
using System.Xml.XPath;
using Microsoft.VisualStudio.TestTools.UnitTesting;

namespace PINET40TestProject
{
    [TestClass]
    public class UtilityUnitTest
    {
        public static string TruncateHTMLSafeishChar(string text, int charCount)
        {
            bool inTag = false;
            int cntr = 0;
            int cntrContent = 0;

            // loop through html, counting only viewable content
            foreach (Char c in text)
            {
                if (cntrContent == charCount) break;
                cntr++;
                if (c == '<')
                {
                    inTag = true;
                    continue;
                }

                if (c == '>')
                {
                    inTag = false;
                    continue;
                }
                if (!inTag) cntrContent++;
            }

            string substr = text.Substring(0, cntr);

            //search for nonclosed tags        
            MatchCollection openedTags = new Regex("<[^/](.|\n)*?>").Matches(substr);
            MatchCollection closedTags = new Regex("<[/](.|\n)*?>").Matches(substr);

            // create stack          
            Stack<string> opentagsStack = new Stack<string>();
            Stack<string> closedtagsStack = new Stack<string>();

            // to be honest, this seemed like a good idea then I got lost along the way 
            // so logic is probably hanging by a thread!! 
            foreach (Match tag in openedTags)
            {
                string openedtag = tag.Value.Substring(1, tag.Value.Length - 2);
                // strip any attributes, sure we can use regex for this!
                if (openedtag.IndexOf(" ") >= 0)
                {
                    openedtag = openedtag.Substring(0, openedtag.IndexOf(" "));
                }

                // ignore brs as self-closed
                if (openedtag.Trim() != "br")
                {
                    opentagsStack.Push(openedtag);
                }
            }

            foreach (Match tag in closedTags)
            {
                string closedtag = tag.Value.Substring(2, tag.Value.Length - 3);
                closedtagsStack.Push(closedtag);
            }

            if (closedtagsStack.Count < opentagsStack.Count)
            {
                while (opentagsStack.Count > 0)
                {
                    string tagstr = opentagsStack.Pop();

                    if (closedtagsStack.Count == 0 || tagstr != closedtagsStack.Peek())
                    {
                        substr += "</" + tagstr + ">";
                    }
                    else
                    {
                        closedtagsStack.Pop();
                    }
                }
            }

            return substr;
        }

        public static string TruncateHTMLSafeishWord(string text, int wordCount)
        {
            bool inTag = false;
            int cntr = 0;
            int cntrWords = 0;
            Char lastc = ' ';

            // loop through html, counting only viewable content
            foreach (Char c in text)
            {
                if (cntrWords == wordCount) break;
                cntr++;
                if (c == '<')
                {
                    inTag = true;
                    continue;
                }

                if (c == '>')
                {
                    inTag = false;
                    continue;
                }
                if (!inTag)
                {
                    // do not count double spaces, and a space not in a tag counts as a word
                    if (c == 32 && lastc != 32)
                        cntrWords++;
                }
            }

            string substr = text.Substring(0, cntr) + " ...";

            //search for nonclosed tags        
            MatchCollection openedTags = new Regex("<[^/](.|\n)*?>").Matches(substr);
            MatchCollection closedTags = new Regex("<[/](.|\n)*?>").Matches(substr);

            // create stack          
            Stack<string> opentagsStack = new Stack<string>();
            Stack<string> closedtagsStack = new Stack<string>();

            foreach (Match tag in openedTags)
            {
                string openedtag = tag.Value.Substring(1, tag.Value.Length - 2);
                // strip any attributes, sure we can use regex for this!
                if (openedtag.IndexOf(" ") >= 0)
                {
                    openedtag = openedtag.Substring(0, openedtag.IndexOf(" "));
                }

                // ignore brs as self-closed
                if (openedtag.Trim() != "br")
                {
                    opentagsStack.Push(openedtag);
                }
            }

            foreach (Match tag in closedTags)
            {
                string closedtag = tag.Value.Substring(2, tag.Value.Length - 3);
                closedtagsStack.Push(closedtag);
            }

            if (closedtagsStack.Count < opentagsStack.Count)
            {
                while (opentagsStack.Count > 0)
                {
                    string tagstr = opentagsStack.Pop();

                    if (closedtagsStack.Count == 0 || tagstr != closedtagsStack.Peek())
                    {
                        substr += "</" + tagstr + ">";
                    }
                    else
                    {
                        closedtagsStack.Pop();
                    }
                }
            }

            return substr;
        }

        public static string TruncateHTMLSafeishCharXML(string text, int charCount)
        {
            // your data, probably comes from somewhere, or as params to a methodint 
            XmlDocument xml = new XmlDocument();
            xml.LoadXml(text);
            // create a navigator, this is our primary tool
            XPathNavigator navigator = xml.CreateNavigator();
            XPathNavigator breakPoint = null;

            // find the text node we need:
            while (navigator.MoveToFollowing(XPathNodeType.Text))
            {
                string lastText = navigator.Value.Substring(0, Math.Min(charCount, navigator.Value.Length));
                charCount -= navigator.Value.Length;
                if (charCount <= 0)
                {
                    // truncate the last text. Here goes your "search word boundary" code:        
                    navigator.SetValue(lastText);
                    breakPoint = navigator.Clone();
                    break;
                }
            }

            // first remove text nodes, because Microsoft unfortunately merges them without asking
            while (navigator.MoveToFollowing(XPathNodeType.Text))
            {
                if (navigator.ComparePosition(breakPoint) == XmlNodeOrder.After)
                {
                    navigator.DeleteSelf();
                }
            }

            // moves to parent, then move the rest
            navigator.MoveTo(breakPoint);
            while (navigator.MoveToFollowing(XPathNodeType.Element))
            {
                if (navigator.ComparePosition(breakPoint) == XmlNodeOrder.After)
                {
                    navigator.DeleteSelf();
                }
            }

            // moves to parent
            // then remove *all* empty nodes to clean up (not necessary):
            // TODO, add empty elements like <br />, <img /> as exclusion
            navigator.MoveToRoot();
            while (navigator.MoveToFollowing(XPathNodeType.Element))
            {
                while (!navigator.HasChildren && (navigator.Value ?? "").Trim() == "")
                {
                    navigator.DeleteSelf();
                }
            }

            // moves to parent
            navigator.MoveToRoot();
            return navigator.InnerXml;
        }

        [TestMethod]
        public void TestTruncateHTMLSafeish()
        {
            // Case where we just make it to start of HREF (so effectively an empty link)

            // 'simple' nested none attributed tags
            Assert.AreEqual(@"<h1>1234</h1><b><i>56789</i>012</b>",
            TruncateHTMLSafeishChar(
                @"<h1>1234</h1><b><i>56789</i>012345</b>",
                12));

            // In middle of a!
            Assert.AreEqual(@"<h1>1234</h1><a href=""testurl""><b>567</b></a>",
            TruncateHTMLSafeishChar(
                @"<h1>1234</h1><a href=""testurl""><b>5678</b></a><i><strong>some italic nested in string</strong></i>",
                7));

            // more
            Assert.AreEqual(@"<div><b><i><strong>1</strong></i></b></div>",
            TruncateHTMLSafeishChar(
                @"<div><b><i><strong>12</strong></i></b></div>",
                1));

            // br
            Assert.AreEqual(@"<h1>1 3 5</h1><br />6",
            TruncateHTMLSafeishChar(
                @"<h1>1 3 5</h1><br />678<br />",
                6));
        }

        [TestMethod]
        public void TestTruncateHTMLSafeishWord()
        {
            // zero case
            Assert.AreEqual(@" ...",
                            TruncateHTMLSafeishWord(
                                @"",
                               5));

            // 'simple' nested none attributed tags
            Assert.AreEqual(@"<h1>one two <br /></h1><b><i>three  ...</i></b>",
            TruncateHTMLSafeishWord(
                @"<h1>one two <br /></h1><b><i>three </i>four</b>",
                3), "we have added ' ...' to end of summary");

            // In middle of a!
            Assert.AreEqual(@"<h1>one two three </h1><a href=""testurl""><b class=""mrclass"">four  ...</b></a>",
            TruncateHTMLSafeishWord(
                @"<h1>one two three </h1><a href=""testurl""><b class=""mrclass"">four five </b></a><i><strong>some italic nested in string</strong></i>",
                4));

            // start of h1
            Assert.AreEqual(@"<h1>one two three  ...</h1>",
            TruncateHTMLSafeishWord(
                @"<h1>one two three </h1><a href=""testurl""><b>four five </b></a><i><strong>some italic nested in string</strong></i>",
                3));

            // more than words available
            Assert.AreEqual(@"<h1>one two three </h1><a href=""testurl""><b>four five </b></a><i><strong>some italic nested in string</strong></i> ...",
            TruncateHTMLSafeishWord(
                @"<h1>one two three </h1><a href=""testurl""><b>four five </b></a><i><strong>some italic nested in string</strong></i>",
                99));
        }

        [TestMethod]
        public void TestTruncateHTMLSafeishWordXML()
        {
            // zero case
            Assert.AreEqual(@" ...",
                            TruncateHTMLSafeishWord(
                                @"",
                               5));

            // 'simple' nested none attributed tags
            string output = TruncateHTMLSafeishCharXML(
                @"<body><h1>one two </h1><b><i>three </i>four</b></body>",
                13);
            Assert.AreEqual(@"<body>\r\n  <h1>one two </h1>\r\n  <b>\r\n    <i>three</i>\r\n  </b>\r\n</body>", output,
             "XML version, no ... yet and addeds '\r\n  + spaces?' to format document");

            // In middle of a!
            Assert.AreEqual(@"<h1>one two three </h1><a href=""testurl""><b class=""mrclass"">four  ...</b></a>",
            TruncateHTMLSafeishCharXML(
                @"<body><h1>one two three </h1><a href=""testurl""><b class=""mrclass"">four five </b></a><i><strong>some italic nested in string</strong></i></body>",
                4));

            // start of h1
            Assert.AreEqual(@"<h1>one two three  ...</h1>",
            TruncateHTMLSafeishCharXML(
                @"<h1>one two three </h1><a href=""testurl""><b>four five </b></a><i><strong>some italic nested in string</strong></i>",
                3));

            // more than words available
            Assert.AreEqual(@"<h1>one two three </h1><a href=""testurl""><b>four five </b></a><i><strong>some italic nested in string</strong></i> ...",
            TruncateHTMLSafeishCharXML(
                @"<h1>one two three </h1><a href=""testurl""><b>four five </b></a><i><strong>some italic nested in string</strong></i>",
                99));
        }
    }
}
like image 733
WickedW Avatar asked Nov 11 '09 12:11

WickedW


People also ask

What C is used for?

C programming language is a machine-independent programming language that is mainly used to create many types of applications and operating systems such as Windows, and other complicated programs such as the Oracle database, Git, Python interpreter, and games and is considered a programming foundation in the process of ...

What is the full name of C?

In the real sense it has no meaning or full form. It was developed by Dennis Ritchie and Ken Thompson at AT&T bell Lab. First, they used to call it as B language then later they made some improvement into it and renamed it as C and its superscript as C++ which was invented by Dr.

Is C language easy?

C is a general-purpose language that most programmers learn before moving on to more complex languages. From Unix and Windows to Tic Tac Toe and Photoshop, several of the most commonly used applications today have been built on C. It is easy to learn because: A simple syntax with only 32 keywords.

What is C language basics?

What is C? C is a general-purpose programming language created by Dennis Ritchie at the Bell Laboratories in 1972. It is a very popular language, despite being old. C is strongly associated with UNIX, as it was developed to write the UNIX operating system.


3 Answers

EDIT: See below for a full solution, this first attempt strips the HTML, the second does not

Let's summarize what you want:

  • No HTML in the result
  • It should take any valid data inside <body>
  • It has a fixed maximum length

If you HTML is XHTML this becomes trivial (and, while I haven't seen the PHP solution, I doubt very much they use a similar approach, but I believe this is understandable and rather easy):

XmlDocument xml = new XmlDocument();

// replace the following line with the content of your full XHTML
xml.LoadXml(@"<body><p>some <i>text</i>here</p><div>that needs stripping</div></body>");

// Get all textnodes under <body> (twice "//" is on purpose)
XmlNodeList nodes = xml.SelectNodes("//body//text()");

// loop through the text nodes, replace this with whatever you like to do with the text
foreach (var node in nodes)
{
    Debug.WriteLine(((XmlCharacterData)node).Value);
}

Note: spaces etc will be preserved. This is usually a good thing.

If you don't have XHTML, you can use the HTML Agility Pack, which let's you do about the same for plain old HTML (it internally converts it to some DOM). I haven't tried it, but it should run rather smooth.


BIG EDIT:

Actual solution

In a little comment I promised to take the XHTML / XmlDocument approach and use that for a typesafe method for splitting your HTML based on text length, but keeping HTML code. I took the following HTML, the code breaks it correctly in the middle of needs, removes the rest, removes empty nodes and automatically closes any open elements.

The sample HTML:

<body>
    <p><tt>some<u><i>text</i>here</u></tt></p>
    <div>that <b><i>needs <span>str</span>ip</i></b><s>ping</s></div>
</body>

The code, tested and working with any kind of input (ok, granted, I just did some tests and code may contain bugs, let me know if you find them!).

// your data, probably comes from somewhere, or as params to a method
int lengthAvailable = 20;
XmlDocument xml = new XmlDocument();
xml.LoadXml(@"place-html-code-here-left-out-for-brevity");

// create a navigator, this is our primary tool
XPathNavigator navigator = xml.CreateNavigator();
XPathNavigator breakPoint = null;


string lastText = "";

// find the text node we need:
while (navigator.MoveToFollowing(XPathNodeType.Text))
{
    lastText = navigator.Value.Substring(0, Math.Min(lengthAvailable, navigator.Value.Length));
    lengthAvailable -= navigator.Value.Length;

    if (lengthAvailable <= 0)
    {
        // truncate the last text. Here goes your "search word boundary" code:
        navigator.SetValue(lastText);
        breakPoint = navigator.Clone();
        break;
    }
}

// first remove text nodes, because Microsoft unfortunately merges them without asking
while (navigator.MoveToFollowing(XPathNodeType.Text))
    if (navigator.ComparePosition(breakPoint) == XmlNodeOrder.After)
        navigator.DeleteSelf();   // moves to parent

// then move the rest
navigator.MoveTo(breakPoint);
while (navigator.MoveToFollowing(XPathNodeType.Element))
    if (navigator.ComparePosition(breakPoint) == XmlNodeOrder.After)
        navigator.DeleteSelf();   // moves to parent

// then remove *all* empty nodes to clean up (not necessary): 
// TODO, add empty elements like <br />, <img /> as exclusion
navigator.MoveToRoot();
while (navigator.MoveToFollowing(XPathNodeType.Element))
    while (!navigator.HasChildren && (navigator.Value ?? "").Trim() == "")
        navigator.DeleteSelf();  // moves to parent

navigator.MoveToRoot();
Debug.WriteLine(navigator.InnerXml);

How the code works

The code does the following things, in that order:

  1. It goes through all text nodes, until the text size expands beyond the allowed limit, in which case it truncates that node. This automatically deals correctly with &gt; etc as one character.
  2. It then shortens the text of the "breaking node" and resets it. It clones the XPathNavigator at this point as we need to remember this "breaking point".
  3. To workaround an MS bug (an ancient one, actually), we have to remove any remaining text nodes first, that follow the breaking point, otherwise we risk auto-merging of text nodes when they end up as siblings of each other. Note: DeleteSelf is handy, but moves the navigator position to its parent, which is why we need to check the current position against the "breaking point" position remembered in the previous step.
  4. Then we do what we wanted to do in the first place: remove any node following the breaking point.
  5. Not a necessary step: cleaning up the code and removing any empty elements. This action is merely to clean up the HTML and/or to filter for specific (dis)allowed elements. It can be left out.
  6. Go back to "root" and get the content as a string with InnerXml.

That's all, rather simple, though it may look a bit daunting at first sight.

PS: the same would be way easier to read and understand were you to use XSLT, which is the ideal tool for this type of jobs.

Update: added extended code sample, based on edited question
Update: added a bit of explanation

like image 102
Abel Avatar answered Oct 08 '22 23:10

Abel


If you want to maintain the html tags you can use this gist which I have recently published. https://gist.github.com/2413598

It uses XmlReader/XmlWriter. It is not production ready, i.e. you'd probably want SgmlReader or HtmlAgilityPack AND you'd want try-catches and choose some fallback...

like image 44
Jaap Avatar answered Oct 08 '22 23:10

Jaap


Ok. This should work (dirty code alert):

        string blah = "hoi <strong>dit <em>is test bla meer tekst</em></strong>";
        int aantalChars = 10;


        bool inTag = false;
        int cntr = 0;
        int cntrContent = 0;
        foreach (Char c in blah)
        {
            if (cntrContent == aantalChars) break;



            cntr++;
            if (c == '<')
            {
                inTag = true;
                continue;
            }
            else if (c == '>')
            {
                inTag = false;
                continue;
            }

            if (!inTag) cntrContent++;
        }

        string substr = blah.Substring(0, cntr);

        //search for nonclosed tags
        MatchCollection openedTags = new Regex("<[^/](.|\n)*?>").Matches(substr);
        MatchCollection closedTags = new Regex("<[/](.|\n)*?>").Matches(substr);

        for (int i =openedTags.Count - closedTags.Count; i >= 1; i--)
        {
            string closingTag = "</" + openedTags[closedTags.Count + i - 1].Value.Substring(1);
            substr += closingTag;
        }
like image 2
Jan Jongboom Avatar answered Oct 09 '22 00:10

Jan Jongboom