I'm trying to get the image index from the .docx file using python-docx library. I'm able to extract the name of the image, image height and width. But not the index where it is in the word file
import docx
doc = docx.Document(filename)
for s in doc.inline_shapes:
print (s.height.cm,s.width.cm,s._inline.graphic.graphicData.pic.nvPicPr.cNvPr.name)
output
21.228 15.920 IMG_20160910_220903848.jpg
In fact I would like to know if there is any simpler way to get the image name , like s.height.cm fetched me the height in cm. My primary requirement is to get to know where the image is in the document, because I need to extract the image and do some work on it and then again put the image back to the same location
Python docx module allows user to manipulate docs by either manipulating the existing one or creating a new empty document and manipulating it. It is a powerful tool as it helps you to manipulate the document to a very large extend. To add an image in a word document we use add_picture () method.
Python docx module allows user to manipulate docs by either manipulating the existing one or creating a new empty document and manipulating it. It is a powerful tool as it helps you to manipulate the document to a very large extend.
But, we can manipulate these word documents in python using the python-docx module. 1. The first step is to install this third-party module python-docx. You can use pip “pip install python-docx” or download the tarball from here. Here’s the Github repository. 2. After installation import “docx” NOT “python-docx”.
import xml. etree. ElementTree as ET #Load the docx file into document object. You can input your own docx file in this step by changing the input path below: Yield each paragraph and table child within *parent*, in document order. also works for a _Cell object, which itself can contain paragraphs and tables.
This operation is not directly supported by the API.
However, if you're willing to dig into the internals a bit and use the underlying lxml
API it's possible.
The general approach would be to access the ImagePart
instance corresponding to the picture you want to inspect and modify, then read and write the ._blob
attribute (which holds the image file as bytes).
This specimen XML might be helpful: http://python-docx.readthedocs.io/en/latest/dev/analysis/features/shapes/picture.html#specimen-xml
From the inline shape containing the picture, you get the <a:blip>
element with this:
blip = inline_shape._inline.graphic.graphicData.pic.blipFill.blip
The relationship id (r:id generally, but r:embed in this case) is available at:
rId = blip.embed
Then you can get the image part from the document part
document_part = document.part
image_part = document_part.related_parts[rId]
And then the binary image is available for read and write on ._blob
.
If you write a new blob, it will replace the prior image when saved.
You probably want to get it working with a single image and get a feel for it before scaling up to multiple images in a single document.
There might be one or two image characteristics that are cached, so you might not get all the finer points working until you save and reload the file, so just be alert for that.
Not for the faint of heart as you can see, but should work if you want it bad enough and can trace through the code a bit :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With