Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Detect encoding of RTF document in Java

Tags:

java

rtf

My Java program does text extraction on RTF files using the RTFEditorKit. Some of the RTF files contain cyrillic characters (Russian), and depending on the RTF version, the extracted text is either okay or contains gibberish. When it's gibberish, I can use this to get normal text:

String text = ... // extracted text

String decodedText = new String(text.getBytes("ISO-8859-1"), "cp1251");

Now the problem is that I couldn't find a way to automatically detect the encoding of the file, i.e. whether the extracted text must be decoded or not. Does anybody know how to do this? Thanks in advance!

EDIT: In the first lines of the RTF files I see something that looks like an encoding:

  • Files where I get gibberish: {\rtf1\ansi\ansicpg1251\deff0\deflang1049
  • Files with okay text: {\rtf1\ansi\ansicpg1251\deff0
like image 843
python dude Avatar asked Oct 18 '25 11:10

python dude


1 Answers

I don't believe the file itself has an encoding. From the Wikipedia page:

RTF is an 8-bit format. That would limit it to ASCII, but RTF can encode characters beyond ASCII by escape sequences. The character escapes are of two types: code page escapes and Unicode escapes. In a code page escape, two hexadecimal digits following an apostrophe are used for denoting a character taken from a Windows code page. For example, if control codes specifying Windows-1256 are present, the sequence \'c8 will encode the Arabic letter beh (ب).

If a Unicode escape is required, the control word \u is used, followed by a 16-bit signed decimal integer giving the Unicode codepoint number.

so I suspect you'll have to extract the text yourself and then parse further using the above rules.

like image 182
Brian Agnew Avatar answered Oct 21 '25 00:10

Brian Agnew



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!