Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse .docx in python 3

I am currently writing a python 3 program that parses through certain docx files and extracts the text and images from them. I have been trying to use docx but it will not import into my program. I have installed lxml, Pillow, and python-docx yet it does not import. When I try to use python-docx from the terminal I cannot use example-extracttext.py or example-makedocument.py which brings me to believe that the installation didn't run properly. Is there a way I can check if this installed correctly or is there a way to get this working properly so I can import it into my project? I am on Ubuntu 13.10.

like image 726
thehoule64 Avatar asked Feb 10 '14 01:02

thehoule64


People also ask

How do I read a docx file in python?

Reading Word Documents docx file in Python, call docx. Document() , and pass the filename demo. docx. This will return a Document object, which has a paragraphs attribute that is a list of Paragraph objects.

What is a run in python-docx?

A run is the object most closely associated with inline content; text, pictures, and other items that are flowed between the block-item boundaries within a paragraph. main content child elements: <w:t> <w:br> <w:drawing>


2 Answers

I recommend you try the latest version of python-docx which is installed like this:

$ pip install python-docx

Documentation is available here: http://python-docx.readthedocs.org/

Installation should result in a message that looks successful. It's possible you'll need to install using sudo to temporarily assume root privileges:

$ sudo pip install python-docx

After installation you should be able to do the following in the Python interpreter:

>>> from docx import Document
>>>

If instead you get something like this, the install didn't go properly:

>>> from docx import Document
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named docx

As you can provide more feedback on your attempts I can elaborate the answer.

Note that after v0.2.x the python-docx package was rewritten. The API of v0.3.x+ is different as well as the package name and repository location. All further development will be on the new version. If you're just starting out with the package going with the latest is probably a good idea as the old one will just be receiving legacy support going forward.

Also, the Python 3 support was added with v0.3.0. Prior versions are not Python 3 compatible.

like image 135
scanny Avatar answered Oct 08 '22 13:10

scanny


you can solve your import problem by first uninstalling the existant installation and then install it with pip3. solved my problem with thisenter image description here

    pip uninstall python-docx
    pip3 install python-docx
like image 1
Farooq Zaman Avatar answered Oct 08 '22 12:10

Farooq Zaman