Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to discover what codepage to use when converting RTF hex literals to Unicode

I'm parsing RTF 1.5+ files generated by Word 2003+ that may have content from other languages. This content is usually encoded as hex literals (\'xx). I would like to convert these literals to unicode values.

I know my document's code page by looking for ansicpg (\ansi\ansicpg1252).

When I use the ansicpg codepage to decode to Unicode, many languages (like French) seem to convert to the Unicode char values that I expect.

However when I see Russian text (like below), codepage 1252 decodes the content to jibberish.

\f277\lang1049\langfe1033\langnp1049\insrsid5989826\charrsid6817286 \'d1\'f2\'f0\'e0\'ed\'e8\'f6\'fb \'e1\'e5\'e7 \'ed\'e0\'e7\'e2\'e0\'ed\'e8\'ff. \'dd\'f2 \'e0 \'f1\'f2\'f0\'e0\'ed\'e8\'f6\'e0 \'ed\'e5 \'e4\'ee\'eb\'e6\'ed\'e0 \'ee\'f2\'ee\'e1\'f0\'e0\'e6\'e0\'f2\'fc\'f1\'ff \'e2 \'f2\'e0\'e1\'eb\'e8\'f6\'e5 \'e2 \'f1\'ee\'e4\'e5\'f0\'e6\'e0\'ed\'e8\'e8.

I assume that lang1049, langfe1033, langnp1049 should provide me clues so I can programmatically choose a different (non-default) code page for the text that they reference? If so, where can I find information that explains how to map a lang* code to a codepage? Or should I be looking for some other RTF command/directive to provide me with the information I'm looking for? (Or must I use \f277 as a font reference and see if it has an associated codepage?)

like image 207
Malcolm Avatar asked Oct 25 '22 06:10

Malcolm


1 Answers

\lang really only marks up particular stretches of the text as being in a particular language, and shouldn't impact what code page is to be used for the old non-Unicode \' escapes.

Putting an \ansicpg token in the header should perhaps do it, but seems to be ignored by Word (for both raw bytes and \' escapes.

Or must I use \f277 as a font reference and see if it has an associated codepage?

It looks that way. Changing the \fcharset of the font assigned to a particular stretch of text is the only way I can get Word to change how it treats the bytes, anyway. The codes in this token (see eg here for list) are, aggravatingly, different again from either the language ID or the code page number.

like image 200
bobince Avatar answered Dec 07 '22 16:12

bobince