Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing RTF Documents with Java/JavaCC

Is anybody familiar with the the RTF document format and parsing using any Java libaries. The standard way people have done this is by using the RTFEditorKit in the JDK Swing API:

Swing RTFEditorKit API

but it isn't that accurate when it comes to parsing RTF documents. In fact there's a comment in the API:

The RTF support was not written by the Swing team. In the future we hope to improve the support provided.

I don't think I'm going to wait for this to happen :)

The other approach taken is to define a grammar using JavaCC and generate a parser. This works better, but I'm having trouble finding a complete grammar. I've tried:

PMD Applied JavaCC Grammar

which is ok and the following (which is the best so far).

Koders RTFParserDelegate and ETranslate Grammar

There are various implementations of the ETranslate grammar about (I know the Nutch API may use this). Does anybody know which is the most accurate grammar or whether there is a better approach to this?

I could start ploughing through the JavaCC docs to understand the .jj files and test it against the RTF files... this is my current approach, but it's taking a while... any help would be appreciated

like image 663
Jon Avatar asked May 12 '09 18:05

Jon


1 Answers

Does anybody know which is the most accurate grammar or whether there is a better approach to this?

Many years ago I spent some time reading RTF (Wikipedia) with C#. I say reading because if you understand RTF in detail and use it the way it was designed you will realize that RTF is not meant to be read as a whole and parsed as a whole over and over again when editing. In the documentation you will find the syntax for RTF, but don't be misled into believing that you should use a lexer/parser. In the documentation they give a sample reader for RTF.

Remember that RTF was created many ages ago when memory was measured in KB and not MB, and editing long documents of several hundred pages in a conventional way would tax system resources. So RFT has the ability to be edited in smaller subsections without loading or modifying the entire document. This is what gives it the ability to work on such large documents with limited memory. It is also why the syntax may seem odd at first.

like image 112
Guy Coder Avatar answered Nov 07 '22 14:11

Guy Coder