I have a simple requirement to extract all the Images and Diagrams drawn in MS Word file. I am able to extract only images but not group of shapes(like Use Case Diagram or Activity Diagram). I want to save all the Diagrams as Image.
I have used apachePOI.
Following code I have written
public class worddocreader {
public static void main(String args[]) {
FileInputStream fis;
try {
FileInputStream fs = new FileInputStream("F:/1.docx");
XWPFDocument docx = new XWPFDocument(fs);
List<XWPFPictureData> piclist = docx.getAllPictures();
Iterator<XWPFPictureData> iterator = piclist.iterator();
int i = 0;
while (iterator.hasNext()) {
XWPFPictureData pic = iterator.next();
byte[] bytepic = pic.getData();
BufferedImage imag = ImageIO.read(new ByteArrayInputStream(
bytepic));
ImageIO.write(imag, "image/jpeg", new File("F:/docParsing/imagefromword" + i + ".jpg"));
i++;
}
ArrayList<PackagePart> packArrayList = docx.getPackageRelationship().getPackage().getParts();
int size = packArrayList.size();
System.out.println("Array List Size : " + packArrayList.size());
while (size-->0) {
PackagePart packagePart = packArrayList.get(size);
System.out.println(packagePart.getContentType());
try{
BufferedImage bfrImage = ImageIO.read(packagePart.getInputStream());
ImageIO.write(bfrImage,"image/png",new File("F:/docParsing_emb/size"+size+".png"));
}catch(Exception e){
e.printStackTrace();
}
}
System.out.println("Done");
} catch (Exception e) {
e.printStackTrace();
}
}
}
It only extract Images not Shapes.
Does anybody knows How do I do this ?
So you are after the stuff defined in [MS-ODRAW], i.e. so-called OfficeDrawings which can be created directly in Word using its Drawing palette?
Unfortunately, POI offers only little help here. With HWPF (the old binary *.doc file format) you can get a handle to such data like so:
HWPFDocument document;
OfficeDrawings officeDrawings = document.getOfficeDrawingsMain();
OfficeDrawing drawing = officeDrawings.getOfficeDrawingAt(OFFSET);
// OFFSET is a global character offset describing the position of the drawing in question
// i.e. document.getRange().getStartOffset() + x
This drawing
can then be further processed into individual records:
EscherRecordManager escherRecordManager = new EscherRecordManager(drawing.getOfficeArtSpContainer());
EscherSpRecord escherSpRecord = escherRecordManager.getSpRecord();
EscherOptRecord escherOptRecord = escherRecordManager.getOptRecord();
Using the data from all these records you can theoretically render out the original drawing again. But it's rather painful...
So far I've only done this in a single case where I had lots of simple arrows floating around on a page. Those had to be converted to a textual representation (something like: "Positions (x1, y1) and (x2, y2) are connected by an arrow"). Doing this essentially meant to implement a subset of [MS-ODRAW] relevant to those arrows using the above-mentioned records. Not exactly a pleasant task.
If using MS Word itself is an option to you, then there is another pragmatic way:
The necessary check for a drawing in step 1 is very simple (see below). The rest can be completely automated in Word. If anyone is in need, I can share the respective VBA code.
if (characterRun.isSpecialCharacter()) {
for (char currentChar : characterRun.text().toCharArray()) {
if ('\u0008' == currentChar) return true;
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With