I got a test for job application, my deal is read some .doc files. Does anyone know a library to do this? I had started with a raw python code: <pre class="prettyprint"><code>f = open('test.doc', 'r') f.read() </code></pre> but this does not return a friendly string I need to convert it to utf-8 Edit: I just want get the text from this file

You can use python-docx2txt library to read text from Microsoft Word documents. It is an improvement over python-docx library as it can, in addition, extract text from links, headers and footers. It can even extract images. You can install it by running: <code>pip install docx2txt</code>. Let's download and read the first Microsoft document on here: <pre class="prettyprint"><code>import docx2txt my_text = docx2txt.process("test.docx") print(my_text) </code></pre> Here is a screenshot of the Terminal output the above code: <img src="https://i.stack.imgur.com/OdMVg.png" alt="enter image description here"> EDIT: This does NOT work for .doc files. The only reason I am keep this answer is that it seems there are people who find it useful for .docx files.

Read .doc file with python

Tags:

python

python-2.7

I got a test for job application, my deal is read some .doc files. Does anyone know a library to do this? I had started with a raw python code:

f = open('test.doc', 'r') f.read()

but this does not return a friendly string I need to convert it to utf-8

Edit: I just want get the text from this file

598

asked Mar 15 '16 02:03

Italo Lemos

2 Answers

One can use the textract library. It take care of both "doc" as well as "docx"

import textract text = textract.process("path/to/file.extension")

You can even use 'antiword' (sudo apt-get install antiword) and then convert doc to first into docx and then read through docx2txt.

antiword filename.doc > filename.docx

Ultimately, textract in the backend is using antiword.

answered Sep 30 '22 02:09

Shivam Kotwalia

You can install it by running: pip install docx2txt.

Let's download and read the first Microsoft document on here:

import docx2txt my_text = docx2txt.process("test.docx") print(my_text)

Here is a screenshot of the Terminal output the above code:

enter image description here

EDIT:

This does NOT work for .doc files. The only reason I am keep this answer is that it seems there are people who find it useful for .docx files.

answered Sep 30 '22 02:09

Billal Begueradj

Related questions
                            
                                UnicodeDecodeError: ('utf-8' codec) while reading a csv file [duplicate]
                            
                                Prime number printer stops at 251, why? [duplicate]
                            
                                Install PyQt5 5.14.1 on Linux
                            
                                Getting started with Twitter\OAuth2\Python
                            
                                Checking if first letter of string is in uppercase
                            
                                How to increase plt.title font size?
                            
                                How to retrieve table names in a mysql database with Python and MySQLdb?
                            
                                Maybe "kind-of" monad in Python
                            
                                Crop an image in the centre using PIL
                            
                                Django Aggregation: Sum return value only?
                            
                                Peewee model to JSON
                            
                                OrderingFilter has no attribute 'filter_queryset'
                            
                                Matplotlib Build Problem: Error C1083: Cannot open include file: 'ft2build.h'
                            
                                Find the newest folder in a directory in Python
                            
                                find a minimum value in an array of floats
                            
                                replacing text in a file with Python
                            
                                How to import a globally installed package to virtualenv folder
                            
                                Change specific value in CSV file via Python
                            
                                Decreasing the size of cPickle objects
                            
                                python - Week number of the month

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With