Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parse HTML to modify all words

Tags:

html

c#

This seems to be a recurring question, but here goes.

I have HTML which is well-formatted (it comes from a controlled source, so this can be taken to be a given). I need to iterate through the contents of the body of the HTML, look for all the words in the document, perform some editing on those words, and save the results.

For example, I have file sample.html and I want to run it through my application and product output.html, which is exactly the same as the original, plus my edits.

I found the following using HTMLAgilityPack, but all the examples I've found look at the attributes of the specified tags - is there an easy modification that will look at the contents and perform my edits?

HtmlDocument HD = new HtmlDocument();
HD.Load (@"e:\test.htm");
var NoAltElements = HD.DocumentNode.SelectNodes("//img[not(@alt)]");
if (NoAltElements != null)
{
    foreach (HtmlNode HN in NoAltElements)
    {
       HN.Attributes.Append("alt", "no alt image");
    }
}

HD.Save(@"e:\test.htm");

The above looks for image tags with no ALT tags. I want to look for all tags in the <body> of the file and do something with the contents (which may involve creating new tags in the process).

A very simple sample of what I might do is take the following input:

<html>
    <head><title>Some Title</title></head>
    <body>
        <h1>This is my page</h1>
        <p>This is a paragraph of text.</p>
    </body>
</html>

and produce the output, which takes every word and alternates between making it uppercase and making it italics:

<html>
    <head><title>Some Title</title></head>
    <body>
        <h1>THIS <em>is</em> MY <em>page</em></h1>
        <p>THIS <em>is</em> A <em>paragraph</em> OF <em>text</em>.</p>
    </body>
</html>

Ideas, suggestions?

like image 377
Elie Avatar asked Feb 11 '11 16:02

Elie


People also ask

What is parseHTML in Javascript?

parseHTML uses native methods to convert the string to a set of DOM nodes, which can then be inserted into the document. These methods do render all trailing or leading text (even if that's just whitespace).

How is HTML parsed?

HTML parsing involves tokenization and tree construction. HTML tokens include start and end tags, as well as attribute names and values. If the document is well-formed, parsing it is straightforward and faster. The parser parses tokenized input into the document, building up the document tree.

What is HTMLParser?

The HTML parser is a structured markup processing tool. It defines a class called HTMLParser, ​which is used to parse HTML files. It comes in handy for web crawling​.


1 Answers

Personally, given this setup, I'd work with the InnerText property of HtmlNode to find the words (probably with Regex so I can exclude for punctuation and not simply rely on spaces) and then use the InnerHtml property to make the changes using iterative calls to Regex.Replace (because the Regex.Replace has a method that allows you to specify both start position and number of times to replace).

Processing code:

IEnumerable<HtmlNode> nodes = doc.DocumentNode.DescendantNodes().Where(n => n.InnerText == "something");
foreach (HtmlNode node in nodes)
{
    string[] words = getWords(node.InnerText);

    node.InnerHtml = processHtml(node.InnerHtml, words);
}

identify words (there's probably some slicker way to do this but here's an initial stab):

private string[] getWords(string text)
{
    Regex reg = new Regex("/w+");
    MatchCollection matches = reg.Matches(text);
    List<string> words = new List<string>();
    foreach (Match match in matches)
    {
        words.Add(match.Value);
    }
    return words.ToArray();
}

process the html:

private string processHtml(string html, string[] words)
{
    int startPosition = 0;
    foreach (string word in words)
    {
        startPosition = html.IndexOf(word, startPosition);
        Regex reg = new Regex(word);
        html = reg.Replace(html, alterWord(word), 1, startPosition);
    }

    return html;
}

I'll leave the details of alterWord() to you. :)

like image 173
Jacob Proffitt Avatar answered Oct 13 '22 00:10

Jacob Proffitt