Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I read word document with bold and italic formatting by using POI

I am using Apache POI.

I am able to read text from a doc file by using "org.apache.poi.hwpf.extractor.WordExtractor"

Even fetched the tables by using "org.apache.poi.hwpf.usermodel.Table"

But please suggest me, how can I fetch bold/italic formatting of the text.

Thanks in advance.

like image 239
Sudeep nayak Avatar asked Jun 05 '13 10:06

Sudeep nayak


2 Answers

WordExtractor returns only the text, nothing else.

The simplest way for you to get the text+formatting of a word document is to switch to using Apache Tika. Apache Tika builds on top of Apache POI (amongst others), and offers both plain text extraction and rich extraction (XHTML with formatting).

Alternately, if you want to write the code yourself, I'd suggest you review the code in Tika's WordExtractor, which demonstrates how to use Apache POI to get the formatting information of runs of text out.

like image 112
Gagravarr Avatar answered Oct 04 '22 21:10

Gagravarr


Instead of using WordExtractor, you can read with Range:

...
HWPFDocument doc = new HWPFDocument(fis);
Range r = doc.getRange();
...

Range is the central class of that model. When you get range, you can play more with the features of the texts and, for instance, iterate through all CharacterRuns, and check if it is Italic (.isItalic()) or change to Italic: (.setItalic(true)).

for(int i = 0; i<r.numCharacterRuns(); i++)
        {
            CharacterRun cr = r.getCharacterRun(i);
            cr.setItalic(true);
            ...
        }

...
File fon = new File(yourFilePathOut);
FileOutputStream fos = new FileOutputStream(fon);
doc.write(fos); 
...

It works if you are stick to use HWPF. Between, to frame into and work with the concept of Paragraph is more convenient.

like image 45
Darius Miliauskas Avatar answered Oct 04 '22 21:10

Darius Miliauskas