Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read shapes group as an image from Word document(.doc or .docx) using apachePOI?

I have a simple requirement to extract all the Images and Diagrams drawn in MS Word file. I am able to extract only images but not group of shapes(like Use Case Diagram or Activity Diagram). I want to save all the Diagrams as Image.

I have used apachePOI.

Following code I have written

public class worddocreader {
public static void main(String args[]) {
    FileInputStream fis;
    try {
        FileInputStream fs = new FileInputStream("F:/1.docx");
        XWPFDocument docx = new XWPFDocument(fs);
        List<XWPFPictureData> piclist = docx.getAllPictures();
        Iterator<XWPFPictureData> iterator = piclist.iterator();
        int i = 0;
        while (iterator.hasNext()) {
            XWPFPictureData pic = iterator.next();
            byte[] bytepic = pic.getData();
            BufferedImage imag = ImageIO.read(new ByteArrayInputStream(
                    bytepic));
            ImageIO.write(imag, "image/jpeg", new File("F:/docParsing/imagefromword" + i + ".jpg"));
            i++;
        }

        ArrayList<PackagePart> packArrayList = docx.getPackageRelationship().getPackage().getParts();
        int size = packArrayList.size();
        System.out.println("Array List Size : " + packArrayList.size());

        while (size-->0) {
            PackagePart packagePart = packArrayList.get(size);

            System.out.println(packagePart.getContentType());

            try{
                BufferedImage bfrImage = ImageIO.read(packagePart.getInputStream());
                ImageIO.write(bfrImage,"image/png",new File("F:/docParsing_emb/size"+size+".png"));
            }catch(Exception e){
                e.printStackTrace();
            }
        }
        System.out.println("Done");
    } catch (Exception e) {
        e.printStackTrace();
    }
}

}

It only extract Images not Shapes.

Does anybody knows How do I do this ?

like image 827
Karsan Avatar asked Jun 30 '14 17:06

Karsan


1 Answers

So you are after the stuff defined in [MS-ODRAW], i.e. so-called OfficeDrawings which can be created directly in Word using its Drawing palette?

Unfortunately, POI offers only little help here. With HWPF (the old binary *.doc file format) you can get a handle to such data like so:

HWPFDocument document;
OfficeDrawings officeDrawings = document.getOfficeDrawingsMain();
OfficeDrawing drawing = officeDrawings.getOfficeDrawingAt(OFFSET);
// OFFSET is a global character offset describing the position of the drawing in question
// i.e. document.getRange().getStartOffset() + x

This drawing can then be further processed into individual records:

EscherRecordManager escherRecordManager = new EscherRecordManager(drawing.getOfficeArtSpContainer());
EscherSpRecord escherSpRecord = escherRecordManager.getSpRecord();
EscherOptRecord escherOptRecord = escherRecordManager.getOptRecord();

Using the data from all these records you can theoretically render out the original drawing again. But it's rather painful...

So far I've only done this in a single case where I had lots of simple arrows floating around on a page. Those had to be converted to a textual representation (something like: "Positions (x1, y1) and (x2, y2) are connected by an arrow"). Doing this essentially meant to implement a subset of [MS-ODRAW] relevant to those arrows using the above-mentioned records. Not exactly a pleasant task.

MS Word backup solution

If using MS Word itself is an option to you, then there is another pragmatic way:

  1. extract all relevant offsets that contain OfficeDrawings using POI.
  2. Inside Word: Iterate over the document with VBA and copy all the drawings at the given offsets to the clipboard.
  3. Use some other application (I chose Visio) to dump the clipboard contents into a PNG.

The necessary check for a drawing in step 1 is very simple (see below). The rest can be completely automated in Word. If anyone is in need, I can share the respective VBA code.

if (characterRun.isSpecialCharacter()) {
    for (char currentChar : characterRun.text().toCharArray()) {
        if ('\u0008' == currentChar) return true;
    }
}
like image 82
morido Avatar answered Nov 06 '22 12:11

morido