Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I fix invalid HTML characters in pages served with different encoding?

I have a number of websites that are rendering invalid characters. The pages' meta tags specify UTF-8 encoding. However, a number of pages contain characters that can't be interpreted by UTF-8, probably because the files were saved with another encoding (such as ANSI). The one in particular I'm concerned about right now is a fancy apostrophe (as in "Bob’s"...sorry if that doesn't show up correctly). W3's validator indicates the entity is "\x92", but it won't validate the file because it doesn't map to unicode. And, of course, if I open the file in Notepad++ and change the encoding to UTF-8, the character is replaced by a 92 in a black box.

Here's my question: what's the easiest way to fix this? Do I have to open all the pages and replace that character with a conventional apostrophe? Or is there a quick fix I could add (say, to IIS) that might override or fix the encoding issue? Or do I have to brute-force find/replace? I have hundreds of pages on these websites and I have no idea how many of them I'd have to change, so if anyone knows a way I could either circumvent this problem or fix it quickly I would appreciate it.

like image 671
Andy Avatar asked Sep 30 '10 17:09

Andy


People also ask

How do I specify character encoding in HTML?

Character encoding can be specified in the meta tag in HTML. The meta tag is used for specifying metadata about the webpage and will not be displayed in the web pages. The meta tag helps search engines to understand what a web page is about. The meta tag should be placed with the head tag in HTML.

How do I set HTML to UTF-8?

The character encoding should be specified for every HTML page, either by using the charset parameter on the Content-Type HTTP response header (e.g.: Content-Type: text/html; charset=utf-8 ) and/or using the charset meta tag in the file.

What is an invalid UTF-8 character?

Non-UTF-8 characters are characters that are not supported by UTF-8 encoding and, they may include symbols or characters from foreign unsupported languages. We'll get an error if we attempt to store these characters to a variable or run a file that contains them.

What is HTML encoding explain with an example?

HTML Encoding means to convert the document that contains special characters outside the range of normal seven-bit ASCII into a standard form. The type of encoding used is sent to the server in form of header information so that it can be easily and correctly parsed by the browsers.


2 Answers

Are you serving the pages as straight HTML, or do you have another script serving the content? If you have a script which is serving the content, that script could just look for any instance of \x92 and replace it with an apostrophe. In PHP this would be a simple str_replace()

If you're serving straight HTML then you'll have to actually modify the files themselves. This can be automated, however (and probably should be if you have hundreds of files) depending on what tools you have available to you and what Operating System you're in. Since you said you're using Notepad++ I suppose it's safe to assume you're in MS Windows (therefore no fun Unix commands to speed things up)

It may be possible to create a BATCH script which can do this, however. There are very simple ASCII text editing tools built into Command Prompt. If that's not possible then it's very possible to make a C or C++ program to do this if you have a compiler on your system and moderate knowledge of C. If you have the former and not the latter, ask and I'll whip up some source for you.

like image 102
stevendesu Avatar answered Sep 17 '22 00:09

stevendesu


I'm not sure about the encoding part of it myself, but if you wind up having to do it by brute force, you could always write a short program that iterates through all of your web pages, loads each file into memory, runs a regex.replace to fix the problem character, and saves the file back to disk. Obviously not ideal but better than opening each file on your own.

Good Luck

like image 45
DJ Quimby Avatar answered Sep 18 '22 00:09

DJ Quimby