In order to comply with accessibility standards, I need to ensure that all images in some dynamically-generated html (which I don't control) have an empty alt tag if none is specified. Example input: <pre class="prettyprint"><code><html> <body> <img src="foo.gif" /> Some other content <img src="bar.gif" alt="" /> <img src="blah.gif" alt="Blah!" /> </body> </html> </code></pre> Desired output: <pre class="prettyprint"><code><html> <body> <img src="foo.gif" alt="" /> Some other content <img src="bar.gif" alt="" /> <img src="blah.gif" alt="Blah!" /> </body> </html> </code></pre> The html could be quite large and the DOM heavily-nested, so using something like the Html Agility Pack is out. Can anyone suggest an efficient way to accomplish this? Update: It is a safe assumption that the html I'm dealing with is well-formed, so a potential solution need not account for that at all.

Your problem seems very specific, you need to alter some output, but you don't want to parse the whole thing with (something general-purpose like) HTMLAgilityPack for performance reasons. The best solution would seem to be to do it the hard way. I would just brute force it. It would be hard to do it more efficiently than something like this (completely untested and almost guaranteed not to work exactly as-is, but logic should be fine, if missing a "+1" or "-1" somewhere): <pre class="prettyprint"><code>string addAltTag(string html) { StringBuilder sb = new StringBuilder(); int pos=0; int lastPos=0; while(pos>=0) { int nextpos; pos=html.IndexOf("<img",pos); if (pos>=0) { // images can't have children, and there should not be any angle braces // anyhere in the attributes, so should work fine nextPos =html.IndexOf(">",pos); } if (nextPos>0) { // back up if XML formed if (html.indexOf(nextPos-1,1)=="/") { nextPos--; } // output everything from last position up to but // before the closing caret sb.Append(html.Substring(lastPos,nextPos-lastPos-1); // can't just look for "alt" could be in the image url or class name if (html.Substring(pos,nextPos-pos).IndexOf(" alt=\"")<0) { sb.Append(" alt="\"\""); } lastPos=nextPos; } else { // unclosed image -- just quit pos=-1; } } sb.Append(html.Substring(lastPos); return sb.ToString(); } </code></pre> You may need to do things like convert to lowercase before testing, parse or test for variants e.g <code>alt = "</code> (that is, with spaces), etc. depending on the consistency you can expect from your HTML. By the way, there is no way this would be faster, but if you want to use something a little more general for some reason, you can also give a shot to CsQuery. This is my own C# implementation of jQuery which would do something like this very easily, e.g. <pre class="prettyprint"><code>obj.Select("img").Not("[alt]").Attr("alt",String.Empty); </code></pre> Since you say that HTML agility pack performs badly on deeply-nested HTML, this may work better for you, because the HTML parser I use is not recursive and should perform linearly regardless of nesting. But it would be far slower than just coding to your exact need since it does, of course, parse the entire document into an object model. Whether that is fast enough for your situation, who knows.

Most efficient way to add missing alt tags for images in a large html document

Tags:

html

c#

accessibility

In order to comply with accessibility standards, I need to ensure that all images in some dynamically-generated html (which I don't control) have an empty alt tag if none is specified.

Example input:

<html>
    <body>
          <img src="foo.gif" />
          <p>Some other content</p>
          <img src="bar.gif" alt="" />
          <img src="blah.gif" alt="Blah!" />
    </body>
</html>

Desired output:

<html>
    <body>
          <img src="foo.gif" alt="" />
          <p>Some other content</p>
          <img src="bar.gif" alt="" />
          <img src="blah.gif" alt="Blah!" />
    </body>
</html>

The html could be quite large and the DOM heavily-nested, so using something like the Html Agility Pack is out.

Can anyone suggest an efficient way to accomplish this?

Update:

It is a safe assumption that the html I'm dealing with is well-formed, so a potential solution need not account for that at all.

456

asked Sep 23 '11 15:09

DanP

1 Answers

Your problem seems very specific, you need to alter some output, but you don't want to parse the whole thing with (something general-purpose like) HTMLAgilityPack for performance reasons. The best solution would seem to be to do it the hard way.

I would just brute force it. It would be hard to do it more efficiently than something like this (completely untested and almost guaranteed not to work exactly as-is, but logic should be fine, if missing a "+1" or "-1" somewhere):

string addAltTag(string html) {
    StringBuilder sb = new StringBuilder();
    int pos=0;
    int lastPos=0;
    while(pos>=0) {
       int nextpos;
       pos=html.IndexOf("<img",pos);
       if (pos>=0) {
          // images can't have children, and there should not be any angle braces 
          // anyhere in the attributes, so should work fine
          nextPos =html.IndexOf(">",pos);

       }

       if (nextPos>0) {
          // back up if XML formed
          if (html.indexOf(nextPos-1,1)=="/") {
            nextPos--;
          }
           // output everything from last position up to but
           // before the closing caret
           sb.Append(html.Substring(lastPos,nextPos-lastPos-1);
           // can't just look for "alt" could be in the image url or class name
           if (html.Substring(pos,nextPos-pos).IndexOf(" alt=\"")<0) {
               sb.Append(" alt="\"\"");
           }
           lastPos=nextPos;
       } else {
           // unclosed image -- just quit
           pos=-1;
       }
    }
    sb.Append(html.Substring(lastPos);
    return sb.ToString();
}

You may need to do things like convert to lowercase before testing, parse or test for variants e.g alt = " (that is, with spaces), etc. depending on the consistency you can expect from your HTML.

By the way, there is no way this would be faster, but if you want to use something a little more general for some reason, you can also give a shot to CsQuery. This is my own C# implementation of jQuery which would do something like this very easily, e.g.

obj.Select("img").Not("[alt]").Attr("alt",String.Empty);

Since you say that HTML agility pack performs badly on deeply-nested HTML, this may work better for you, because the HTML parser I use is not recursive and should perform linearly regardless of nesting. But it would be far slower than just coding to your exact need since it does, of course, parse the entire document into an object model. Whether that is fast enough for your situation, who knows.

answered Sep 21 '22 23:09

Jamie Treworgy

Related questions
                            
                                Inheritance problem in C#
                            
                                WPF View ListView Backwards
                            
                                How can I layout a decoupled class structure for a simple game?
                            
                                C#: Wait until progressbar finished drawing [duplicate]
                            
                                Windows 7 64 bit and accessing Win32 API calls via P/Invoke & Marshal problems
                            
                                reading an XML string using LINQ
                            
                                The "is" and "as" operators in C#
                            
                                C# importing C++ dll
                            
                                How to do CamelCase with German words (or with any other language that supports compound nouns)?
                            
                                C# - How to make two forms reference each other
                            
                                Managing serial ports in C# [duplicate]
                            
                                ReadLine() vs Read() to Get CR and LF Efficiently?
                            
                                Multi return statement STRANGE?
                            
                                Active tab ignored by InternetExplorer COM object for IE 8
                            
                                Using PagedList with a viewmodel MVC 3
                            
                                C# OpenQA and OperaDriver() problem. No Opera at selenium OpenQA v2.5
                            
                                One event handler for all controls on the form
                            
                                (C#) Can I programmatically set an XLSX cell to a picture/image?
                            
                                Can you write a unit test assembly in C# to test against an assembly written in VB?
                            
                                Clearing C# Events after execution?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With