How to parse HTML to modify all words

Tags:

c#

This seems to be a recurring question, but here goes.

I have HTML which is well-formatted (it comes from a controlled source, so this can be taken to be a given). I need to iterate through the contents of the body of the HTML, look for all the words in the document, perform some editing on those words, and save the results.

For example, I have file sample.html and I want to run it through my application and product output.html, which is exactly the same as the original, plus my edits.

I found the following using HTMLAgilityPack, but all the examples I've found look at the attributes of the specified tags - is there an easy modification that will look at the contents and perform my edits?

HtmlDocument HD = new HtmlDocument();
HD.Load (@"e:\test.htm");
var NoAltElements = HD.DocumentNode.SelectNodes("//img[not(@alt)]");
if (NoAltElements != null)
{
    foreach (HtmlNode HN in NoAltElements)
    {
       HN.Attributes.Append("alt", "no alt image");
    }
}

HD.Save(@"e:\test.htm");

The above looks for image tags with no ALT tags. I want to look for all tags in the <body> of the file and do something with the contents (which may involve creating new tags in the process).

A very simple sample of what I might do is take the following input:

<html>
    <head><title>Some Title</title></head>
    <body>
        <h1>This is my page</h1>
        <p>This is a paragraph of text.</p>
    </body>
</html>

and produce the output, which takes every word and alternates between making it uppercase and making it italics:

<html>
    <head><title>Some Title</title></head>
    <body>
        <h1>THIS <em>is</em> MY <em>page</em></h1>
        <p>THIS <em>is</em> A <em>paragraph</em> OF <em>text</em>.</p>
    </body>
</html>

Ideas, suggestions?

377

asked Feb 11 '11 16:02

Elie

1 Answers

Personally, given this setup, I'd work with the InnerText property of HtmlNode to find the words (probably with Regex so I can exclude for punctuation and not simply rely on spaces) and then use the InnerHtml property to make the changes using iterative calls to Regex.Replace (because the Regex.Replace has a method that allows you to specify both start position and number of times to replace).

Processing code:

IEnumerable<HtmlNode> nodes = doc.DocumentNode.DescendantNodes().Where(n => n.InnerText == "something");
foreach (HtmlNode node in nodes)
{
    string[] words = getWords(node.InnerText);

    node.InnerHtml = processHtml(node.InnerHtml, words);
}

identify words (there's probably some slicker way to do this but here's an initial stab):

private string[] getWords(string text)
{
    Regex reg = new Regex("/w+");
    MatchCollection matches = reg.Matches(text);
    List<string> words = new List<string>();
    foreach (Match match in matches)
    {
        words.Add(match.Value);
    }
    return words.ToArray();
}

process the html:

private string processHtml(string html, string[] words)
{
    int startPosition = 0;
    foreach (string word in words)
    {
        startPosition = html.IndexOf(word, startPosition);
        Regex reg = new Regex(word);
        html = reg.Replace(html, alterWord(word), 1, startPosition);
    }

    return html;
}

I'll leave the details of alterWord() to you. :)

173

answered Oct 13 '22 00:10

Jacob Proffitt

Related questions
                            
                                Mixing 32-bit and 64-bit P/Invokes
                            
                                <PrivateImplementationDetails>{GUID}.method$$**** in code file. Doesn't compile !
                            
                                Which SQL Server sql data type to use in order to preserve UTC date time
                            
                                How do I explicitly run the static constructor of an unknown type? [duplicate]
                            
                                Delete Image from PictureBox in C#
                            
                                Securely erasing a file using simple methods? [duplicate]
                            
                                How do I programmatically send information to a web service in C# with .NET?
                            
                                Trying to not need two separate solutions for x86 and x64 program
                            
                                Writing dot net desktop application with MVC design pattern
                            
                                Whats the difference between using String.Equals(str1,str2) and str1 == str2 [duplicate]
                            
                                Multi-programmer programming IDE or plugin
                            
                                Getting a WPF Listview to display ObservableCollection<T> using databinding
                            
                                Free font and color chooser for WPF?
                            
                                Defining a color as a static resource
                            
                                Formatting Microsoft Chart Control X Axis labels for sub-categories to be like charts generated in Excel
                            
                                Invoking Word for rtf to docx conversion
                            
                                TCP speed tester algorithm question
                            
                                Single statement conditionals - why is the pattern not used for other code blocks?
                            
                                DataContractSerializer serializing List<T> getting error
                            
                                Linq - merging sub lists from different objects into a single object

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to parse HTML to modify all words

Tags:

html

c#

Elie

People also ask

1 Answers

Jacob Proffitt

Recent Activity

Donate For Us