Decoding Korean text files from the 90s

Question

I have a collection of .html files created in the mid-90s, which include a significant ammount of Korean text. The HTML lacks character set metadata, so of course all of the Korean text now does not render properly. The following examples will all make use of the same excerpt of text .

In text editors such as Coda and Text Wrangler the text displays as

╙╦ ╝№бя└К ▓щ╥НВь╕цль▒Ф ▓щ╥НВь╕цль▒Ф

Which in the absence of character set metadata in < head > is rendered by the browser as:

ÓË ¼ü¡ïÀŠ ²éÒ‚ì¸æ«ì±” ²éÒ‚ì¸æ«ì±”

Adding euc-kr metadata to < head >

<meta http-equiv="Content-Type" content="text/html; charset=euc-kr">

Yields the following, which is illegible nonsense (verified by a native speaker):

沓 숩∽핅 꿴�귥멩レ콛 꿴�귥멩レ콛

I have tried this approach with all historic Korean character sets, each yielding similarly unsuccessful results. I also tried parsing and upgrading to UTF-8, via Beautiful Soup, which also failed.

Viewing the files in Emacs seems promising, as it reveals the text encoding a lower level. The following is the same sample of text:

\323\313 \274\374\241\357\300\212 \262\351\322\215\202\354\270\346\253\354\261\224 \262\3\ 51\322\215\202\354\270\346\253\354\261\224

How can I identify this text encoding and promote it to UTF-8?

Sean Redmond · Accepted Answer

All of those octal codes that emacs revealed are less than 254 (or \376 in octal), so it looks like one of those old pre-Unicode fonts that just used it's own mapping in the ASCII range. If this is right, you'll just have to try to figure out what font it was intended for, find it and perhaps do the conversion yourself.

It's a pain. Many years ago I did something similar for some popular pre-Unicode Greek fonts: http://litot.es/unicode-converter/ (the code: https://github.com/seanredmond/Encoding-Converter)

Decoding Korean text files from the 90s

Tags:

emacs

character-encoding

encoding

utf-8

In text editors such as Coda and Text Wrangler the text displays as

Adding euc-kr metadata to < head >

dongle

1 Answers

Sean Redmond

Recent Activity

Donate For Us

Decoding Korean text files from the 90s

Tags:

emacs

character-encoding

encoding

utf-8

In text editors such as Coda and Text Wrangler the text displays as

Adding euc-kr metadata to < head >

dongle

1 Answers

Sean Redmond

Related questions

Recent Activity

Donate For Us