Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Decoding Korean text files from the 90s

I have a collection of .html files created in the mid-90s, which include a significant ammount of Korean text. The HTML lacks character set metadata, so of course all of the Korean text now does not render properly. The following examples will all make use of the same excerpt of text .

In text editors such as Coda and Text Wrangler the text displays as

╙╦ ╝№бя└К ▓щ╥НВь╕цль▒Ф ▓щ╥НВь╕цль▒Ф

Which in the absence of character set metadata in < head > is rendered by the browser as:

ÓË ¼ü¡ïÀŠ ²éÒ‚ì¸æ«ì±” ²éÒ‚ì¸æ«ì±”


Adding euc-kr metadata to < head >

<meta http-equiv="Content-Type" content="text/html; charset=euc-kr">

Yields the following, which is illegible nonsense (verified by a native speaker):

沓 숩∽핅 꿴�귥멩レ콛 꿴�귥멩レ콛


I have tried this approach with all historic Korean character sets, each yielding similarly unsuccessful results. I also tried parsing and upgrading to UTF-8, via Beautiful Soup, which also failed.

Viewing the files in Emacs seems promising, as it reveals the text encoding a lower level. The following is the same sample of text:

\323\313 \274\374\241\357\300\212 \262\351\322\215\202\354\270\346\253\354\261\224 \262\3\ 51\322\215\202\354\270\346\253\354\261\224


How can I identify this text encoding and promote it to UTF-8?

like image 840
dongle Avatar asked Jun 17 '12 17:06

dongle


1 Answers

All of those octal codes that emacs revealed are less than 254 (or \376 in octal), so it looks like one of those old pre-Unicode fonts that just used it's own mapping in the ASCII range. If this is right, you'll just have to try to figure out what font it was intended for, find it and perhaps do the conversion yourself.

It's a pain. Many years ago I did something similar for some popular pre-Unicode Greek fonts: http://litot.es/unicode-converter/ (the code: https://github.com/seanredmond/Encoding-Converter)

like image 80
Sean Redmond Avatar answered Sep 21 '22 13:09

Sean Redmond