My Java program does text extraction on RTF files using the RTFEditorKit. Some of the RTF files contain cyrillic characters (Russian), and depending on the RTF version, the extracted text is either okay or contains gibberish. When it's gibberish, I can use this to get normal text:
String text = ... // extracted text
String decodedText = new String(text.getBytes("ISO-8859-1"), "cp1251");
Now the problem is that I couldn't find a way to automatically detect the encoding of the file, i.e. whether the extracted text must be decoded or not. Does anybody know how to do this? Thanks in advance!
EDIT: In the first lines of the RTF files I see something that looks like an encoding:
I don't believe the file itself has an encoding. From the Wikipedia page:
RTF is an 8-bit format. That would limit it to ASCII, but RTF can encode characters beyond ASCII by escape sequences. The character escapes are of two types: code page escapes and Unicode escapes. In a code page escape, two hexadecimal digits following an apostrophe are used for denoting a character taken from a Windows code page. For example, if control codes specifying Windows-1256 are present, the sequence \'c8 will encode the Arabic letter beh (ب).
If a Unicode escape is required, the control word \u is used, followed by a 16-bit signed decimal integer giving the Unicode codepoint number.
so I suspect you'll have to extract the text yourself and then parse further using the above rules.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With