Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Python to extract images and text from a word document

I would like to run a script on a folder full of word documents that reads through the documents and pulls out images and their captions (text right below the images). From the research I've done, I think pywin32 might be a viable solution. I know how to use pywin32 to find strings and pull them out, but I need help with the images part. How can I read through a docx file and have an event occur when an image is found? Thank you for any help! I am using Python 2.7.

like image 419
Preston Donovan Avatar asked Jun 14 '11 14:06

Preston Donovan


People also ask

Can we extract text from image using python?

Tesseract is an open source OCR (optical character recognition) engine which allows to extract text from images. In order to use it in Python, we will also need the pytesseract library which is a wrapper for Tesseract engine.

Can we extract data from image in python?

In python we use a library called PIL (python imaging Library). The modules in this library is used for image processing and has support for many file formats like png, jpg, bmp, gif etc. It comes with large number of functions that can be used to open, extract data, change properties, create new images and much more…


1 Answers

Docx files can be unzipped for extracting the images.

like image 84
Kevin C. Avatar answered Oct 12 '22 16:10

Kevin C.