Mainly when we shorten/truncate textual content we usually just truncate it at specific character index. That's already complicated in HTML anyway, but I want to truncate my HTML content (generated using content-editable div
) using different measures:
N
that will serve as truncation startpoint limit
N
characters long (text only; not counting tags); if it's not, it will just return the whole contentN-X
to N+X
character position (text only) and search for ends of block nodes; X
is predefined offset value and likely about N/5
to N/4
;N
N
and truncate at that position.My content-editable generated content may consist of paragraphs (with line breaks), preformatted code blocks, block quotes, ordered and unordered lists, headers, bolds and italics (which are inline nodes and shouldn't count in truncation process) etc. Final implementation will of course define which elements specifically are possible truncation candidates. Headers even though they are block HTML elements will not count as truncation points as we don't want widowed headers. Paragraphs, list individual items, whole ordered and unordered lists, block quotes, preformatted blocks, void elements etc. are good ones. Headers and all inline block elements aren't.
Let's take this very stackoverflow question as an example of HTML content that I would like to truncate. Let's set truncation limit to 1000 with offset of 250 characters (1/4).
This DotNetFiddle shows text of this question while also adding limit markers inside of it (|MIN|
which represents character 750, |LIMIT|
representing character 1000 and |MAX|
that represents character 1250).
As can be seen from example the closest truncation boundary between two block nodes to character 1000 is between </OL>
and P
(My content-editable generated...). This means that my HTML should be truncated right between these two tags which would result in a little bit less than 1000 characters long content text wise, but kept truncated content meaningful because it wouldn't just truncate somewhere in the middle of some text passage.
I hope this explains how things should be working related to this algorithm.
The first problem I'm seeing here is that I'm dealing with nested structure like HTML. I also have to detect different elements (only block elements and no inline ones). And last but not least I will have to only count certain characters in my string and ignore those that belong to tags.
N
and convert back to HTMLHow should one approach such truncation algorithm? My head just seems to be too tired to come to a consensus (or solution).
Create a function truncate(str, maxlength) that checks the length of the str and, if it exceeds maxlength – replaces the end of str with the ellipsis character "…" , to make its length equal to maxlength . The result of the function should be the truncated (if needed) string.
With line-clamp text can be truncated after multiple lines, whats even more interesting is you can truncate it by specifying the line number where you want to truncate it. eg: -webkit-line-clamp: 3; will truncate start truncating the text from the third line.
text_truncate = function(str, length, ending) { if (length == null) { length = 100; } if (ending == null) { ending = '...'; } if (str. length > length) { return str. substring(0, length - ending. length) + ending; } else { return str; } }; console.
TEXT TRUNCATION IS THE PROCESS OF shortening text content. If text is truncated, it is usually followed with 3 periods called an ellipsis. On webpages, there are several ways to shorten text content so that it fits within a certain designated area.
Here is some sample code that can truncate the inner text. It uses the recursive capability of the InnerText
property and CloneNode
method.
public static HtmlNode TruncateInnerText(HtmlNode node, int length)
{
if (node == null)
throw new ArgumentNullException("node");
// nothing to do?
if (node.InnerText.Length < length)
return node;
HtmlNode clone = node.CloneNode(false);
TruncateInnerText(node, clone, clone, length);
return clone;
}
private static void TruncateInnerText(HtmlNode source, HtmlNode root, HtmlNode current, int length)
{
HtmlNode childClone;
foreach (HtmlNode child in source.ChildNodes)
{
// is expected size is ok?
int expectedSize = child.InnerText.Length + root.InnerText.Length;
if (expectedSize <= length)
{
// yes, just clone the whole hierarchy
childClone = child.CloneNode(true);
current.ChildNodes.Add(childClone);
continue;
}
// is it a text node? then crop it
HtmlTextNode text = child as HtmlTextNode;
if (text != null)
{
int remove = expectedSize - length;
childClone = root.OwnerDocument.CreateTextNode(text.InnerText.Substring(0, text.InnerText.Length - remove));
current.ChildNodes.Add(childClone);
return;
}
// it's not a text node, shallow clone and dive in
childClone = child.CloneNode(false);
current.ChildNodes.Add(childClone);
TruncateInnerText(child, root, childClone, length);
}
}
And a sample C# console app that will scrap this question as an example, and truncate it to 500 characters.
class Program
{
static void Main(string[] args)
{
var web = new HtmlWeb();
var doc = web.Load("http://stackoverflow.com/questions/30926684/truncating-html-content-at-the-end-of-text-blocks-block-elements");
var post = doc.DocumentNode.SelectSingleNode("//td[@class='postcell']//div[@class='post-text']");
var truncated = TruncateInnerText(post, 500);
Console.WriteLine(truncated.OuterHtml);
Console.WriteLine("Size: " + truncated.InnerText.Length);
}
}
When ran it, it should display this:
<div class="post-text" itemprop="text">
<p>Mainly when we shorten/truncate textual content we usually just truncate it at specific character index. That's already complicated in HTML anyway, but I want to truncate my HTML content (generated using content-editable <code>div</code>) using different measures:</p>
<ol>
<li>I would define character index <code>N</code> that will serve as truncating startpoint <em>limit</em></li>
<li>Algorithm will check whether content is at least <code>N</code> characters long (text only; not counting tags); if it's not it will just return the whole content</li>
<li>It would then</li></ol></div>
Size: 500
Note: I have not truncated at word boundary, just at character boundary, and no, it's not at all following the suggestions in my comment :-)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With