Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert a CP949 RTF to a UTF-8 encoded RTF?

I wanna write a python script that converts file encoding from cp949 to utf8. The file is orginally encoded in cp949. My script is as follows:

cpstr = open('terms.rtf').read()  
utfstr = cpstr.decode('cp949').encode('utf-8')  
tmp  = open('terms_utf.rtf', 'w')  
tmp.write(utfstr)  
tmp.close()

But this doesn't change the encoding as I intended.

like image 539
Arena Son Avatar asked Dec 24 '13 02:12

Arena Son


People also ask

How do I change my encoding to UTF-8?

UTF-8 Encoding in Notepad (Windows)Click File in the top-left corner of your screen. In the dialog which appears, select the following options: In the "Save as type" drop-down, select All Files. In the "Encoding" drop-down, select UTF-8.

How do I convert a file to UTF-8?

Click Tools, then select Web options. Go to the Encoding tab. In the dropdown for Save this document as: choose Unicode (UTF-8). Click Ok.

Does RTF support UTF-8?

Looks like RTF doesn't know UTF-8 at all, only Unicode in general.


1 Answers

There are three kinds of RTF, and I have no idea which kind you have. You can tell by opening the file in a plain-text editor, or just using less/more/cat/type/whatever to print it out to your terminal.


First, the easy cases: plaintext RTF.

A plaintext RTF file starts of with {\rtf, and all of the text within it is (as you'd expect) plain text—although sometimes runs of text will be broken up into separate runs with formatting commands—which start with \—in between them. Since all of the formatting commands are pure ASCII, if you convert a plaintext RTF from one charset to another (as long as both are supersets of ASCII, as cp949 and utf-8 both are), it should work fine.

However, the file may also have a formatting command that specifies what character set it's written in. This command looks like \ansicpg949. When an RTF editor like Wordpad opens your file, it will interpret all your nice UTF-8 data as cp949 data and mojibake the hell out of it unless you fix it.

The simplest way to fix it is to figure out what charset your editor wants to put there for UTF-8 files. Maybe it's \ansicpg65001, maybe it's \utf8, maybe it's something completely different. So just save a simple file as a UTF-8 RTF, then look at it in plain text, and see what it has in place of \ansicpg949, and replace the string in your file with the right one. (Note that code page 65001 is not really UTF-8, but it's close, and a lot of Microsoft code assumes they're the same…)

Also, some RTF editors (like Apple's TextEdit) will escape any non-ASCII characters (so, e.g., a é is stored as \'e9), so there's nothing to convert.

Finally, Office Open XML includes an XML spec for something that's called RTF, but isn't really the same thing. I believe many RTF editors can handle this. Fortunately, you can treat this the same way as plaintext RTF—all of the XML tags have pure-ASCII names.


The almost-as-easy case is compressed plaintext RTF. This is the same thing, but compressed with, I believe, zlib. Or it can actually be RTFD (which can be plaintext RTF together with a images and other things in separate files, or actual plain text with formatting runs stored in a separate file) in a .zip archive. Anyway, if you have one of these, the file command on most Unix systems should be able to detect it as "compressed RTF", at which point we can figure out what the specific format is and decompress it, and then you can edit it as plaintext RTF (or RTFD).

Needless to say, if you don't uncompress this first, you won't see any of your familiar text in the file—and you could easily end up breaking it so it can't be decompressed, or decompresses to garbage, by changing arbitrary bytes to different bytes.


Finally, the hard case: binary RTF.

The earliest versions of these were in an undocumented format, although they've been reverse-engineered. The later versions are public specs. Wikipedia has links to the specs. If you want to parse it manually you can, but it's going to be a substantial amount of code, and you're going to have to write it yourself.

A better solution would be to use one of the many libraries on PyPI that can convert RTF (including binary RTF) to other formats, which you can then edit easily.

like image 57
abarnert Avatar answered Oct 12 '22 06:10

abarnert