Detect encoding of RTF document in Java

Question

My Java program does text extraction on RTF files using the RTFEditorKit. Some of the RTF files contain cyrillic characters (Russian), and depending on the RTF version, the extracted text is either okay or contains gibberish. When it's gibberish, I can use this to get normal text:

String text = ... // extracted text

String decodedText = new String(text.getBytes("ISO-8859-1"), "cp1251");

Now the problem is that I couldn't find a way to automatically detect the encoding of the file, i.e. whether the extracted text must be decoded or not. Does anybody know how to do this? Thanks in advance!

EDIT: In the first lines of the RTF files I see something that looks like an encoding:

Files where I get gibberish: {\rtf1\ansi\ansicpg1251\deff0\deflang1049
Files with okay text: {\rtf1\ansi\ansicpg1251\deff0

Brian Agnew · Accepted Answer

I don't believe the file itself has an encoding. From the Wikipedia page:

RTF is an 8-bit format. That would limit it to ASCII, but RTF can encode characters beyond ASCII by escape sequences. The character escapes are of two types: code page escapes and Unicode escapes. In a code page escape, two hexadecimal digits following an apostrophe are used for denoting a character taken from a Windows code page. For example, if control codes specifying Windows-1256 are present, the sequence \'c8 will encode the Arabic letter beh (ب).

If a Unicode escape is required, the control word \u is used, followed by a 16-bit signed decimal integer giving the Unicode codepoint number.

so I suspect you'll have to extract the text yourself and then parse further using the above rules.

Detect encoding of RTF document in Java

Tags:

java

rtf

python dude

1 Answers

Brian Agnew

Recent Activity

Donate For Us

Detect encoding of RTF document in Java

Tags:

java

rtf

python dude

1 Answers

Brian Agnew

Related questions

Recent Activity

Donate For Us