Does anybody have an example of spliting a html string (coming from a tiny mce editor) and splitting it into N parts using C#? I need to split the string evenly without splitting words. I was thinking of just splitting the html and using the HtmlAgilityPack to try and fix the broken tags. Though I'm not sure how to find the split point, as Ideally it should be based purley on the text rather than the html aswell. Anybody got any ideas on how to go about this? UPDATE As requested, here is an example of input and desired output. INPUT: <pre class="prettyprint"><code>Lorem ipsum dolor sit amet, consectetur adipiscing elit. </code></pre> OUTPUT (When split into 3 cols): <pre class="prettyprint"><code>Part1: Lorem ipsum dolor Part2: sit amet, consectetur Part3: adipiscing elit. </code></pre> UPDATE 2: I've just had a play with Tidy HTML and that seems to work well at fixing broken tags, so this may be good option if I can find a way to locate the split pints? UPDATE 3 Using a method similar to this Truncate string on whole words in .NET C#, I've now managed to get a list of plain text words that will make up each part. So, say using Tidy HTML I have a valid XML structure for the html, and given this list of words, anybody got an idea on what would now be the best way to split it? UPDATE 4 Can anybody see an issue with using a regex to find the indices with the HTML in the followin way: Given the plain text string "sit amet, consectetur", replace all spaces with the regex "(\s|<(.|\n)+?>)*", in theory finding that string with any combination of spaces and/or tags I could then just use Tidy HTML to fix the broken html tags? Many thanks Matt

<h3>A Proposed Solution</h3> Man, this is a curse of mine! I apparently cannot walk away from a problem without spending up-to-and-including an unreasonable amount of time on it. I thought about this. I thought about HTML Tidy, and maybe it would work, but I had trouble wrapping my head around it. So, I wrote my own solution. I tested this on your input and on some other input that I threw together myself. It seems to work pretty well. Surely there are holes in it, but it might provide you with a starting point. Anyway, my approach was this: <ol> <li>Encapsulate the notion of a single word in an HTML document using a class that includes information about that word's position in the HTML document hierarchy, up to a given "top". This I have implemented in the <code>HtmlWord</code> class below.</li> <li>Create a class that is capable of writing a single line composed of these HTML words above, such that start-element and end-element tags are added in the appropriate places. This I have implemented in the <code>HtmlLine</code> class below.</li> <li>Write a few extension methods to make these classes immediately and intuitively accessible straight from an <code>HtmlAgilityPack.HtmlNode</code> object. These I have implemented in the <code>HtmlHelper</code> class below.</li> </ol> Am I crazy for doing all this? Probably, yes. But, you know, if you can't figure out any other way, you can give this a try. Here's how it works with your sample input: <pre class="prettyprint"><code>var document = new HtmlDocument(); document.LoadHtml("Lorem ipsum dolor sit amet, consectetur adipiscing elit."); var nodeToSplit = document.DocumentNode.SelectSingleNode("p"); var lines = nodeToSplit.SplitIntoLines(3); foreach (var line in lines) Console.WriteLine(line.ToString()); </code></pre> Output: <pre class="prettyprint"><code>Lorem ipsum dolor sit amet, consectetur adipiscing elit. </code></pre> And now for the code: <h3>HtmlWord class</h3> <pre class="prettyprint"><code>using System; using System.Collections.Generic; using System.Linq; using HtmlAgilityPack; public class HtmlWord { public string Text { get; private set; } public HtmlNode[] NodeStack { get; private set; } // convenience property to display list of ancestors cleanly // (for ease of debugging) public string NodeList { get { return string.Join(", ", NodeStack.Select(n => n.Name).ToArray()); } } internal HtmlWord(string text, HtmlNode node, HtmlNode top) { Text = text; NodeStack = GetNodeStack(node, top); } private static HtmlNode[] GetNodeStack(HtmlNode node, HtmlNode top) { var nodes = new Stack<HtmlNode>(); while (node != null && !node.Equals(top)) { nodes.Push(node); node = node.ParentNode; }; return nodes.ToArray(); } } </code></pre> <h3>HtmlLine class</h3> <pre class="prettyprint"><code>using System; using System.Collections.Generic; using System.IO; using System.Linq; using System.Text; using System.Xml; using HtmlAgilityPack; [Flags()] public enum NodeChange { None = 0, Dropped = 1, Added = 2 } public class HtmlLine { private List<HtmlWord> _words; public IList<HtmlWord> Words { get { return _words.AsReadOnly(); } } public int WordCount { get { return _words.Count; } } public HtmlLine(IEnumerable<HtmlWord> words) { _words = new List<HtmlWord>(words); } private static NodeChange CompareNodeStacks(HtmlWord x, HtmlWord y, out HtmlNode[] droppedNodes, out HtmlNode[] addedNodes) { var droppedList = new List<HtmlNode>(); var addedList = new List<HtmlNode>(); // traverse x's NodeStack backwards to see which nodes // do not include y (and are therefore "finished") foreach (var node in x.NodeStack.Reverse()) { if (!Array.Exists(y.NodeStack, n => n.Equals(node))) droppedList.Add(node); } // traverse y's NodeStack forwards to see which nodes // do not include x (and are therefore "new") foreach (var node in y.NodeStack) { if (!Array.Exists(x.NodeStack, n => n.Equals(node))) addedList.Add(node); } droppedNodes = droppedList.ToArray(); addedNodes = addedList.ToArray(); NodeChange change = NodeChange.None; if (droppedNodes.Length > 0) change &= NodeChange.Dropped; if (addedNodes.Length > 0) change &= NodeChange.Added; // could maybe use this in some later revision? // not worth the effort right now... return change; } public override string ToString() { if (WordCount < 1) return string.Empty; var lineBuilder = new StringBuilder(); using (var lineWriter = new StringWriter(lineBuilder)) using (var xmlWriter = new XmlTextWriter(lineWriter)) { var firstWord = _words[0]; foreach (var node in firstWord.NodeStack) { xmlWriter.WriteStartElement(node.Name); foreach (var attr in node.Attributes) xmlWriter.WriteAttributeString(attr.Name, attr.Value); } xmlWriter.WriteString(firstWord.Text + " "); for (int i = 1; i < WordCount; ++i) { var previousWord = _words[i - 1]; var word = _words[i]; HtmlNode[] droppedNodes; HtmlNode[] addedNodes; CompareNodeStacks( previousWord, word, out droppedNodes, out addedNodes ); foreach (var dropped in droppedNodes) xmlWriter.WriteEndElement(); foreach (var added in addedNodes) { xmlWriter.WriteStartElement(added.Name); foreach (var attr in added.Attributes) xmlWriter.WriteAttributeString(attr.Name, attr.Value); } xmlWriter.WriteString(word.Text + " "); if (i == _words.Count - 1) { foreach (var node in word.NodeStack) xmlWriter.WriteEndElement(); } } } return lineBuilder.ToString(); } } </code></pre> <h3>HtmlHelper static class</h3> <pre class="prettyprint"><code>using System; using System.Collections.Generic; using System.Linq; using HtmlAgilityPack; public static class HtmlHelper { public static IList<HtmlLine> SplitIntoLines(this HtmlNode node, int wordsPerLine) { var lines = new List<HtmlLine>(); var words = node.GetWords(node.ParentNode); for (int i = 0; i < words.Count; i += wordsPerLine) { lines.Add(new HtmlLine(words.Skip(i).Take(wordsPerLine))); } return lines.AsReadOnly(); } public static IList<HtmlWord> GetWords(this HtmlNode node, HtmlNode top) { var words = new List<HtmlWord>(); if (node.HasChildNodes) { foreach (var child in node.ChildNodes) words.AddRange(child.GetWords(top)); } else { var textNode = node as HtmlTextNode; if (textNode != null && !string.IsNullOrEmpty(textNode.Text)) { string[] singleWords = textNode.Text.Split( new string[] {" "}, StringSplitOptions.RemoveEmptyEntries ); words.AddRange( singleWords .Select(w => new HtmlWord(w, node.ParentNode, top) ) ); } } return words.AsReadOnly(); } } </code></pre> <h3>Conclusion</h3> Just to reiterate: this is a thrown-together solution; I'm sure it has problems. I present it only as a starting point for you to consider -- again, if you're unable to get the behavior you want through other means.

Split a html string in N parts

Tags:

c#

regex

html-agility-pack

htmltidy

Does anybody have an example of spliting a html string (coming from a tiny mce editor) and splitting it into N parts using C#?

I need to split the string evenly without splitting words.

I was thinking of just splitting the html and using the HtmlAgilityPack to try and fix the broken tags. Though I'm not sure how to find the split point, as Ideally it should be based purley on the text rather than the html aswell.

Anybody got any ideas on how to go about this?

UPDATE

As requested, here is an example of input and desired output.

INPUT:

<p><strong>Lorem ipsum dolor sit amet, <em>consectetur adipiscing</em></strong> elit.</p>

OUTPUT (When split into 3 cols):

Part1: <p><strong>Lorem ipsum dolor</strong></p>
Part2: <p><strong>sit amet, <em>consectetur</em></strong></p>
Part3: <p><strong><em>adipiscing</em></strong> elit.</p>

UPDATE 2:

I've just had a play with Tidy HTML and that seems to work well at fixing broken tags, so this may be good option if I can find a way to locate the split pints?

UPDATE 3

Using a method similar to this Truncate string on whole words in .NET C#, I've now managed to get a list of plain text words that will make up each part. So, say using Tidy HTML I have a valid XML structure for the html, and given this list of words, anybody got an idea on what would now be the best way to split it?

UPDATE 4

Can anybody see an issue with using a regex to find the indices with the HTML in the followin way:

Given the plain text string "sit amet, consectetur", replace all spaces with the regex "(\s|<(.|\n)+?>)*", in theory finding that string with any combination of spaces and/or tags

I could then just use Tidy HTML to fix the broken html tags?

Many thanks

Matt

251

asked May 01 '10 13:05

Matt Brailsford

1 Answers

A Proposed Solution

Man, this is a curse of mine! I apparently cannot walk away from a problem without spending up-to-and-including an unreasonable amount of time on it.

I thought about this. I thought about HTML Tidy, and maybe it would work, but I had trouble wrapping my head around it.

So, I wrote my own solution.

I tested this on your input and on some other input that I threw together myself. It seems to work pretty well. Surely there are holes in it, but it might provide you with a starting point.

Anyway, my approach was this:

Encapsulate the notion of a single word in an HTML document using a class that includes information about that word's position in the HTML document hierarchy, up to a given "top". This I have implemented in the HtmlWord class below.
Create a class that is capable of writing a single line composed of these HTML words above, such that start-element and end-element tags are added in the appropriate places. This I have implemented in the HtmlLine class below.
Write a few extension methods to make these classes immediately and intuitively accessible straight from an HtmlAgilityPack.HtmlNode object. These I have implemented in the HtmlHelper class below.

Am I crazy for doing all this? Probably, yes. But, you know, if you can't figure out any other way, you can give this a try.

Here's how it works with your sample input:

var document = new HtmlDocument();
document.LoadHtml("<p><strong>Lorem ipsum dolor sit amet, <em>consectetur adipiscing</em></strong> elit.</p>");

var nodeToSplit = document.DocumentNode.SelectSingleNode("p");
var lines = nodeToSplit.SplitIntoLines(3);

foreach (var line in lines)
    Console.WriteLine(line.ToString());

Output:

<p><strong>Lorem ipsum dolor </strong></p>
<p><strong>sit amet, <em>consectetur </em></strong></p>
<p><strong><em>adipiscing </em></strong>elit. </p>

And now for the code:

HtmlWord class

using System;
using System.Collections.Generic;
using System.Linq;

using HtmlAgilityPack;

public class HtmlWord {
    public string Text { get; private set; }
    public HtmlNode[] NodeStack { get; private set; }

    // convenience property to display list of ancestors cleanly
    // (for ease of debugging)
    public string NodeList {
        get { return string.Join(", ", NodeStack.Select(n => n.Name).ToArray()); }
    }

    internal HtmlWord(string text, HtmlNode node, HtmlNode top) {
        Text = text;
        NodeStack = GetNodeStack(node, top);
    }

    private static HtmlNode[] GetNodeStack(HtmlNode node, HtmlNode top) {
        var nodes = new Stack<HtmlNode>();

        while (node != null && !node.Equals(top)) {
            nodes.Push(node);
            node = node.ParentNode;
        };

        return nodes.ToArray();
    }
}

HtmlLine class

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Xml;

using HtmlAgilityPack;

[Flags()]
public enum NodeChange {
    None = 0,
    Dropped = 1,
    Added = 2
}

public class HtmlLine {
    private List<HtmlWord> _words;
    public IList<HtmlWord> Words {
        get { return _words.AsReadOnly(); }
    }

    public int WordCount {
        get { return _words.Count; }
    }

    public HtmlLine(IEnumerable<HtmlWord> words) {
        _words = new List<HtmlWord>(words);
    }

    private static NodeChange CompareNodeStacks(HtmlWord x, HtmlWord y, out HtmlNode[] droppedNodes, out HtmlNode[] addedNodes) {
        var droppedList = new List<HtmlNode>();
        var addedList = new List<HtmlNode>();

        // traverse x's NodeStack backwards to see which nodes
        // do not include y (and are therefore "finished")
        foreach (var node in x.NodeStack.Reverse()) {
            if (!Array.Exists(y.NodeStack, n => n.Equals(node)))
                droppedList.Add(node);
        }

        // traverse y's NodeStack forwards to see which nodes
        // do not include x (and are therefore "new")
        foreach (var node in y.NodeStack) {
            if (!Array.Exists(x.NodeStack, n => n.Equals(node)))
                addedList.Add(node);
        }

        droppedNodes = droppedList.ToArray();
        addedNodes = addedList.ToArray();

        NodeChange change = NodeChange.None;
        if (droppedNodes.Length > 0)
            change &= NodeChange.Dropped;
        if (addedNodes.Length > 0)
            change &= NodeChange.Added;

        // could maybe use this in some later revision?
        // not worth the effort right now...
        return change;
    }

    public override string ToString() {
        if (WordCount < 1)
            return string.Empty;

        var lineBuilder = new StringBuilder();

        using (var lineWriter = new StringWriter(lineBuilder))
        using (var xmlWriter = new XmlTextWriter(lineWriter)) {
            var firstWord = _words[0];
            foreach (var node in firstWord.NodeStack) {
                xmlWriter.WriteStartElement(node.Name);
                foreach (var attr in node.Attributes)
                    xmlWriter.WriteAttributeString(attr.Name, attr.Value);
            }
            xmlWriter.WriteString(firstWord.Text + " ");

            for (int i = 1; i < WordCount; ++i) {
                var previousWord = _words[i - 1];
                var word = _words[i];

                HtmlNode[] droppedNodes;
                HtmlNode[] addedNodes;

                CompareNodeStacks(
                    previousWord,
                    word,
                    out droppedNodes,
                    out addedNodes
                );

                foreach (var dropped in droppedNodes)
                    xmlWriter.WriteEndElement();
                foreach (var added in addedNodes) {
                    xmlWriter.WriteStartElement(added.Name);
                    foreach (var attr in added.Attributes)
                        xmlWriter.WriteAttributeString(attr.Name, attr.Value);
                }

                xmlWriter.WriteString(word.Text + " ");

                if (i == _words.Count - 1) {
                    foreach (var node in word.NodeStack)
                        xmlWriter.WriteEndElement();
                }
            }
        }

        return lineBuilder.ToString();
    }
}

HtmlHelper static class

using System;
using System.Collections.Generic;
using System.Linq;

using HtmlAgilityPack;

public static class HtmlHelper {
    public static IList<HtmlLine> SplitIntoLines(this HtmlNode node, int wordsPerLine) {
        var lines = new List<HtmlLine>();

        var words = node.GetWords(node.ParentNode);

        for (int i = 0; i < words.Count; i += wordsPerLine) {
            lines.Add(new HtmlLine(words.Skip(i).Take(wordsPerLine)));
        }

        return lines.AsReadOnly();
    }

    public static IList<HtmlWord> GetWords(this HtmlNode node, HtmlNode top) {
        var words = new List<HtmlWord>();

        if (node.HasChildNodes) {
            foreach (var child in node.ChildNodes)
                words.AddRange(child.GetWords(top));
        } else {
            var textNode = node as HtmlTextNode;
            if (textNode != null && !string.IsNullOrEmpty(textNode.Text)) {
                string[] singleWords = textNode.Text.Split(
                    new string[] {" "},
                    StringSplitOptions.RemoveEmptyEntries
                );
                words.AddRange(
                    singleWords
                        .Select(w => new HtmlWord(w, node.ParentNode, top)
                    )
                );
            }
        }

        return words.AsReadOnly();
    }
}

Conclusion

Just to reiterate: this is a thrown-together solution; I'm sure it has problems. I present it only as a starting point for you to consider -- again, if you're unable to get the behavior you want through other means.

135

answered Oct 01 '22 09:10

Dan Tao

Related questions
                            
                                GetGenericTypeDefinition returning false when looking for IEnumerable<T> in List<T>
                            
                                C# why resize image will increase the file size
                            
                                How does RFC2898DeriveBytes generate an AES key?
                            
                                C# Reflection - changing the value of a field of a variable
                            
                                Where to put my custom Html Helpers?
                            
                                Compare DataRow collection to List<T>
                            
                                Dictionary<int [], bool> - compare values in the array, not reference?
                            
                                How to change modifier of a control to Static in Visual Studio
                            
                                Looking For an Opensource Project C# [closed]
                            
                                Formatting sentences in a string using C#
                            
                                convert double value to binary value
                            
                                Is the Non-Virtual Interface (NVI) idiom as useful in C# as in C++?
                            
                                Deleting an Image that has been used by a WPF control
                            
                                ASP.NET error on Bitmap.Save "Exception (0x80004005): A generic error occurred in GDI+."
                            
                                How to populate a generic list of objects in C# from SQL database
                            
                                Is it possible to create a file with a given size and MD5 hash?
                            
                                Textbox value changed
                            
                                Purpose of form1.designer.cs and form1.resx
                            
                                How to convert string encoded in windows-1250 to unicode?
                            
                                Simple in-place discrete fourier transform ( DFT )

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With