Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java RTF Parser

Tags:

java

parsing

rtf

Does anyone know of a robust RTF parser I can use in Java? I need to extract plain text, including international text. It would also be nice to extract embedded images and files. It could also be a C++ or other library that I can easily call, or if there is good source code, I can convert to Java.

The following libraries do not cover enough of the RTF, or fail to parse some valid RTFs

  1. Java Swing's RTFEditorKit, quite basic and brittle Apache Tikka, nutch, and lots of other tools use this.
  2. an RTF library from iText (com.lowagie.etc...), not too comprehensive
  3. etranslate rtf library (this is the most complete of the java ones) Not sure if there is an updated version, but the version I got fails on some of my rtf collection (the RTFs are valid, at least they open in MsWord and OpenOffice OK).

There's a C# library that's reasonably complete, but alas ...it's C# and not Java. http://www.codeproject.com/Articles/27431/Writing-Your-Own-RTF-Converter

I also looked into OpenOffice, it is too slow for what I need, though it's probably very comprehensive.

(I did do web searches and stack overflow searches before posting this question, so if you are referring me to an ancient "already asked" post, it probably doesn't have an answer there. But feel free to point it out, in case I missed it!)

like image 747
Mary Avatar asked Jun 20 '13 21:06

Mary


People also ask

What is Java RTF?

OpenRTF is a Java library for creating and editing RTF (Rich Text Format) files with a LGPL and MPL open source license.

What is Lexer parser Java?

Lexer and parser perform tasks in an orderly manner. It means that the lexer will first read the input data and generate a list of tokens. Then, the parser reads the tokens generated and outputs the results. Lexer recognizes plain characters or words in a given alphabet.

What is a Java parser?

A parser is a Java class that extracts attributes from a local file and stores the information in the repository. More specifically, in the case of a document, a parser: Takes in an InputStream or Reader object. Processes the character input, extracting attributes as it goes.

How do you write parser in Java?

The first step in writing a parser is to tokenize the input string. This means to separate the input string into short bits that represent the basic entities in the expression. We could do this by hand, reading one character at a time and assembling the tokens character by character.


2 Answers

You may find RTF Parser Kit useful. It provides a stream-based parser which delivers events to you as the document is parsed. There is a simple example text extractor provided which demonstrates how the API can be used.

like image 110
Jon Iles Avatar answered Oct 10 '22 14:10

Jon Iles


If your project is non-commercial then there is a good free Java rtf to xml library here, better than etranslate in my opinion, and you can process the xml from there. However if you are using it for commercial purposes you will have to arrange licensing with rtf-to-xml.com, the company that developed it.

However having once been in a similar situation, before finding rtf-to-xml, I found a funny work around for this problem when I need to parse ms rtf on linux server. There is a free rich text processor, which is also a library called Ted It takes arguments from the command line with out the user interface and can be wrapped in JNI call.

I hope this helps.

like image 2
J-Boss Avatar answered Oct 10 '22 14:10

J-Boss