In order to comply with accessibility standards, I need to ensure that all images in some dynamically-generated html (which I don't control) have an empty alt tag if none is specified.
Example input:
<html>
<body>
<img src="foo.gif" />
<p>Some other content</p>
<img src="bar.gif" alt="" />
<img src="blah.gif" alt="Blah!" />
</body>
</html>
Desired output:
<html>
<body>
<img src="foo.gif" alt="" />
<p>Some other content</p>
<img src="bar.gif" alt="" />
<img src="blah.gif" alt="Blah!" />
</body>
</html>
The html could be quite large and the DOM heavily-nested, so using something like the Html Agility Pack is out.
Can anyone suggest an efficient way to accomplish this?
Update:
It is a safe assumption that the html I'm dealing with is well-formed, so a potential solution need not account for that at all.
When choosing alt text, focus on creating useful, information-rich content that uses keywords appropriately and is in context of the content of the page. Avoid filling alt attributes with keywords (keyword stuffing) as it results in a negative user experience and may cause your site to be seen as spam.
The required alt attribute specifies an alternate text for an image, if the image cannot be displayed. The alt attribute provides alternative information for an image if a user for some reason cannot view it (because of slow connection, an error in the src attribute, or if the user uses a screen reader).
Definition: An alt tag, also known as "alt attribute" and "alt description," is an HTML attribute applied to image tags to provide a text alternative for search engines. Applying images to alt tags such as product photos can positively impact an ecommerce store's search engine rankings.
Your problem seems very specific, you need to alter some output, but you don't want to parse the whole thing with (something general-purpose like) HTMLAgilityPack for performance reasons. The best solution would seem to be to do it the hard way.
I would just brute force it. It would be hard to do it more efficiently than something like this (completely untested and almost guaranteed not to work exactly as-is, but logic should be fine, if missing a "+1" or "-1" somewhere):
string addAltTag(string html) {
StringBuilder sb = new StringBuilder();
int pos=0;
int lastPos=0;
while(pos>=0) {
int nextpos;
pos=html.IndexOf("<img",pos);
if (pos>=0) {
// images can't have children, and there should not be any angle braces
// anyhere in the attributes, so should work fine
nextPos =html.IndexOf(">",pos);
}
if (nextPos>0) {
// back up if XML formed
if (html.indexOf(nextPos-1,1)=="/") {
nextPos--;
}
// output everything from last position up to but
// before the closing caret
sb.Append(html.Substring(lastPos,nextPos-lastPos-1);
// can't just look for "alt" could be in the image url or class name
if (html.Substring(pos,nextPos-pos).IndexOf(" alt=\"")<0) {
sb.Append(" alt="\"\"");
}
lastPos=nextPos;
} else {
// unclosed image -- just quit
pos=-1;
}
}
sb.Append(html.Substring(lastPos);
return sb.ToString();
}
You may need to do things like convert to lowercase before testing, parse or test for variants e.g alt = "
(that is, with spaces), etc. depending on the consistency you can expect from your HTML.
By the way, there is no way this would be faster, but if you want to use something a little more general for some reason, you can also give a shot to CsQuery. This is my own C# implementation of jQuery which would do something like this very easily, e.g.
obj.Select("img").Not("[alt]").Attr("alt",String.Empty);
Since you say that HTML agility pack performs badly on deeply-nested HTML, this may work better for you, because the HTML parser I use is not recursive and should perform linearly regardless of nesting. But it would be far slower than just coding to your exact need since it does, of course, parse the entire document into an object model. Whether that is fast enough for your situation, who knows.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With