I wanna write a python script that converts file encoding from cp949 to utf8. The file is orginally encoded in cp949. My script is as follows: <pre class="prettyprint"><code>cpstr = open('terms.rtf').read() utfstr = cpstr.decode('cp949').encode('utf-8') tmp = open('terms_utf.rtf', 'w') tmp.write(utfstr) tmp.close() </code></pre> But this doesn't change the encoding as I intended.

There are three kinds of RTF, and I have no idea which kind you have. You can tell by opening the file in a plain-text editor, or just using <code>less</code>/<code>more</code>/<code>cat</code>/<code>type</code>/whatever to print it out to your terminal. <hr> First, the easy cases: plaintext RTF. A plaintext RTF file starts of with <code>{\rtf</code>, and all of the text within it is (as you'd expect) plain text—although sometimes runs of text will be broken up into separate runs with formatting commands—which start with <code>\</code>—in between them. Since all of the formatting commands are pure ASCII, if you convert a plaintext RTF from one charset to another (as long as both are supersets of ASCII, as cp949 and utf-8 both are), it should work fine. However, the file may also have a formatting command that specifies what character set it's written in. This command looks like <code>\ansicpg949</code>. When an RTF editor like Wordpad opens your file, it will interpret all your nice UTF-8 data as cp949 data and mojibake the hell out of it unless you fix it. The simplest way to fix it is to figure out what charset your editor wants to put there for UTF-8 files. Maybe it's <code>\ansicpg65001</code>, maybe it's <code>\utf8</code>, maybe it's something completely different. So just save a simple file as a UTF-8 RTF, then look at it in plain text, and see what it has in place of <code>\ansicpg949</code>, and replace the string in your file with the right one. (Note that code page 65001 is not really UTF-8, but it's close, and a lot of Microsoft code assumes they're the same…) Also, some RTF editors (like Apple's TextEdit) will escape any non-ASCII characters (so, e.g., a <code>é</code> is stored as <code>\'e9</code>), so there's nothing to convert. Finally, Office Open XML includes an XML spec for something that's called RTF, but isn't really the same thing. I believe many RTF editors can handle this. Fortunately, you can treat this the same way as plaintext RTF—all of the XML tags have pure-ASCII names. <hr> The almost-as-easy case is compressed plaintext RTF. This is the same thing, but compressed with, I believe, zlib. Or it can actually be RTFD (which can be plaintext RTF together with a images and other things in separate files, or actual plain text with formatting runs stored in a separate file) in a .zip archive. Anyway, if you have one of these, the <code>file</code> command on most Unix systems should be able to detect it as "compressed RTF", at which point we can figure out what the specific format is and decompress it, and then you can edit it as plaintext RTF (or RTFD). Needless to say, if you don't uncompress this first, you won't see any of your familiar text in the file—and you could easily end up breaking it so it can't be decompressed, or decompresses to garbage, by changing arbitrary bytes to different bytes. <hr> Finally, the hard case: binary RTF. The earliest versions of these were in an undocumented format, although they've been reverse-engineered. The later versions are public specs. Wikipedia has links to the specs. If you want to parse it manually you can, but it's going to be a substantial amount of code, and you're going to have to write it yourself. A better solution would be to use one of the many libraries on PyPI that can convert RTF (including binary RTF) to other formats, which you can then edit easily.

How to convert a CP949 RTF to a UTF-8 encoded RTF?

Tags:

python

encoding

utf-8

rtf

I wanna write a python script that converts file encoding from cp949 to utf8. The file is orginally encoded in cp949. My script is as follows:

Click to copy

cpstr = open('terms.rtf').read()  
utfstr = cpstr.decode('cp949').encode('utf-8')  
tmp  = open('terms_utf.rtf', 'w')  
tmp.write(utfstr)  
tmp.close()

But this doesn't change the encoding as I intended.

539

asked Dec 24 '13 02:12

Arena Son

1 Answers

There are three kinds of RTF, and I have no idea which kind you have. You can tell by opening the file in a plain-text editor, or just using less/more/cat/type/whatever to print it out to your terminal.

First, the easy cases: plaintext RTF.

A plaintext RTF file starts of with {\rtf, and all of the text within it is (as you'd expect) plain text—although sometimes runs of text will be broken up into separate runs with formatting commands—which start with \—in between them. Since all of the formatting commands are pure ASCII, if you convert a plaintext RTF from one charset to another (as long as both are supersets of ASCII, as cp949 and utf-8 both are), it should work fine.

However, the file may also have a formatting command that specifies what character set it's written in. This command looks like \ansicpg949. When an RTF editor like Wordpad opens your file, it will interpret all your nice UTF-8 data as cp949 data and mojibake the hell out of it unless you fix it.

The simplest way to fix it is to figure out what charset your editor wants to put there for UTF-8 files. Maybe it's \ansicpg65001, maybe it's \utf8, maybe it's something completely different. So just save a simple file as a UTF-8 RTF, then look at it in plain text, and see what it has in place of \ansicpg949, and replace the string in your file with the right one. (Note that code page 65001 is not really UTF-8, but it's close, and a lot of Microsoft code assumes they're the same…)

Also, some RTF editors (like Apple's TextEdit) will escape any non-ASCII characters (so, e.g., a é is stored as \'e9), so there's nothing to convert.

Finally, Office Open XML includes an XML spec for something that's called RTF, but isn't really the same thing. I believe many RTF editors can handle this. Fortunately, you can treat this the same way as plaintext RTF—all of the XML tags have pure-ASCII names.

The almost-as-easy case is compressed plaintext RTF. This is the same thing, but compressed with, I believe, zlib. Or it can actually be RTFD (which can be plaintext RTF together with a images and other things in separate files, or actual plain text with formatting runs stored in a separate file) in a .zip archive. Anyway, if you have one of these, the file command on most Unix systems should be able to detect it as "compressed RTF", at which point we can figure out what the specific format is and decompress it, and then you can edit it as plaintext RTF (or RTFD).

Needless to say, if you don't uncompress this first, you won't see any of your familiar text in the file—and you could easily end up breaking it so it can't be decompressed, or decompresses to garbage, by changing arbitrary bytes to different bytes.

Finally, the hard case: binary RTF.

The earliest versions of these were in an undocumented format, although they've been reverse-engineered. The later versions are public specs. Wikipedia has links to the specs. If you want to parse it manually you can, but it's going to be a substantial amount of code, and you're going to have to write it yourself.

A better solution would be to use one of the many libraries on PyPI that can convert RTF (including binary RTF) to other formats, which you can then edit easily.

answered Oct 12 '22 06:10

abarnert

Related questions
                            
                                Splitting a string by using two substrings in Python
                            
                                scipy.optimize solution using python for the following equation
                            
                                Pyqt Wheel event
                            
                                Use global variables as default values
                            
                                How to install LXML Python 3.3 Windows 8 64 Bit
                            
                                Using scipy's kmeans2 function in python
                            
                                How to to determine the number of ways a number can be broken down into sums of smaller numbers
                            
                                Pytest on Python Tools for visual studio
                            
                                Indexing a nested list in python
                            
                                Python inheritance - how to inherit class function?
                            
                                How to provide periodic table information to Python module
                            
                                Most pythonic way to import all objects in a module as their name in the module
                            
                                Qt horizontalSlider send float values
                            
                                Missing 1 required positional argument - Why?
                            
                                Difference between decode and unicode?
                            
                                SQLAlchemy update PostgreSQL array using merge not work
                            
                                Efficiently partition a string at arbitrary index
                            
                                JSON Schema validating ip-address is not working
                            
                                POST data to Python CGI script via jQuery AJAX
                            
                                How To Add An Image To A Tweet With TwitterAPI?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With