I'm trying to get the image index from the .docx file using python-docx library. I'm able to extract the name of the image, image height and width. But not the index where it is in the word file <pre class="prettyprint"><code>import docx doc = docx.Document(filename) for s in doc.inline_shapes: print (s.height.cm,s.width.cm,s._inline.graphic.graphicData.pic.nvPicPr.cNvPr.name) </code></pre> output <pre class="prettyprint"><code>21.228 15.920 IMG_20160910_220903848.jpg </code></pre> In fact I would like to know if there is any simpler way to get the image name , like s.height.cm fetched me the height in cm. My primary requirement is to get to know where the image is in the document, because I need to extract the image and do some work on it and then again put the image back to the same location

This operation is not directly supported by the API. However, if you're willing to dig into the internals a bit and use the underlying <code>lxml</code> API it's possible. The general approach would be to access the <code>ImagePart</code> instance corresponding to the picture you want to inspect and modify, then read and write the <code>._blob</code> attribute (which holds the image file as bytes). This specimen XML might be helpful: http://python-docx.readthedocs.io/en/latest/dev/analysis/features/shapes/picture.html#specimen-xml From the inline shape containing the picture, you get the <code><a:blip></code> element with this: <pre class="prettyprint"><code>blip = inline_shape._inline.graphic.graphicData.pic.blipFill.blip </code></pre> The relationship id (r:id generally, but r:embed in this case) is available at: <pre class="prettyprint"><code>rId = blip.embed </code></pre> Then you can get the image part from the document part <pre class="prettyprint"><code>document_part = document.part image_part = document_part.related_parts[rId] </code></pre> And then the binary image is available for read and write on <code>._blob</code>. If you write a new blob, it will replace the prior image when saved. You probably want to get it working with a single image and get a feel for it before scaling up to multiple images in a single document. There might be one or two image characteristics that are cached, so you might not get all the finer points working until you save and reload the file, so just be alert for that. Not for the faint of heart as you can see, but should work if you want it bad enough and can trace through the code a bit :)

Extract image position from .docx file using python-docx

Tags:

python

python-docx

I'm trying to get the image index from the .docx file using python-docx library. I'm able to extract the name of the image, image height and width. But not the index where it is in the word file

import docx
doc = docx.Document(filename)
for s in doc.inline_shapes:
    print (s.height.cm,s.width.cm,s._inline.graphic.graphicData.pic.nvPicPr.cNvPr.name)

output

21.228  15.920 IMG_20160910_220903848.jpg

In fact I would like to know if there is any simpler way to get the image name , like s.height.cm fetched me the height in cm. My primary requirement is to get to know where the image is in the document, because I need to extract the image and do some work on it and then again put the image back to the same location

294

asked Dec 17 '16 15:12

argo

1 Answers

This operation is not directly supported by the API.

However, if you're willing to dig into the internals a bit and use the underlying lxml API it's possible.

The general approach would be to access the ImagePart instance corresponding to the picture you want to inspect and modify, then read and write the ._blob attribute (which holds the image file as bytes).

This specimen XML might be helpful: http://python-docx.readthedocs.io/en/latest/dev/analysis/features/shapes/picture.html#specimen-xml

From the inline shape containing the picture, you get the <a:blip> element with this:

blip = inline_shape._inline.graphic.graphicData.pic.blipFill.blip

The relationship id (r:id generally, but r:embed in this case) is available at:

rId = blip.embed

Then you can get the image part from the document part

document_part = document.part
image_part = document_part.related_parts[rId]

And then the binary image is available for read and write on ._blob.

If you write a new blob, it will replace the prior image when saved.

You probably want to get it working with a single image and get a feel for it before scaling up to multiple images in a single document.

There might be one or two image characteristics that are cached, so you might not get all the finer points working until you save and reload the file, so just be alert for that.

Not for the faint of heart as you can see, but should work if you want it bad enough and can trace through the code a bit :)

104

answered Sep 27 '22 17:09

scanny

Related questions
                            
                                'AttributeError: 'module' object has no attribute 'file'' when using oauth2client with Google Calendar
                            
                                python pandas read_csv quotechar does not work
                            
                                Python error message "Incompatible library version" libxml and etree.so
                            
                                how to use python "get()" for keys deeper than first level of dictionary keys?
                            
                                Issue with UTF-/ encoding on csv file for excel
                            
                                How to accumulate an array by index in numpy? [duplicate]
                            
                                Use of 'random_state' parameter in sklearn.utils.shuffle?
                            
                                Reading a github file using python returns HTML tags
                            
                                Unable to install Statsmodels...python
                            
                                How to ignore NULL byte when reading a csv file
                            
                                How do I apply both bold and italics in python-docx?
                            
                                Python: ImportError: No module named 'tutorial.quickstart'
                            
                                How to rename (exposed in API) filter field name using django-filters?
                            
                                How to map a column with dask
                            
                                pyexcel export error "No content, file name. Nothing is given"
                            
                                struct.error: unpack requires a string argument of length 16
                            
                                How to edit the label font sizes on building a treemap with squarify in Python?
                            
                                Python Convert String to Byte
                            
                                How to generate reports in Behave-Python?
                            
                                Jenkins not printing output of python script in console

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With