I have a collection of .html files created in the mid-90s, which include a significant ammount of Korean text. The HTML lacks character set metadata, so of course all of the Korean text now does not render properly. The following examples will all make use of the same excerpt of text .
╙╦ ╝№бя└К ▓щ╥НВь╕цль▒Ф ▓щ╥НВь╕цль▒Ф
Which in the absence of character set metadata in < head > is rendered by the browser as:
ÓË ¼ü¡ïÀŠ ²éÒ‚ì¸æ«ì±” ²éÒ‚ì¸æ«ì±”
<meta http-equiv="Content-Type" content="text/html; charset=euc-kr">
Yields the following, which is illegible nonsense (verified by a native speaker):
沓 숩∽핅 꿴�귥멩レ콛 꿴�귥멩レ콛
I have tried this approach with all historic Korean character sets, each yielding similarly unsuccessful results. I also tried parsing and upgrading to UTF-8, via Beautiful Soup, which also failed.
Viewing the files in Emacs seems promising, as it reveals the text encoding a lower level. The following is the same sample of text:
\323\313 \274\374\241\357\300\212 \262\351\322\215\202\354\270\346\253\354\261\224 \262\3\ 51\322\215\202\354\270\346\253\354\261\224
How can I identify this text encoding and promote it to UTF-8?
All of those octal codes that emacs revealed are less than 254 (or \376 in octal), so it looks like one of those old pre-Unicode fonts that just used it's own mapping in the ASCII range. If this is right, you'll just have to try to figure out what font it was intended for, find it and perhaps do the conversion yourself.
It's a pain. Many years ago I did something similar for some popular pre-Unicode Greek fonts: http://litot.es/unicode-converter/ (the code: https://github.com/seanredmond/Encoding-Converter)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With