Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read .doc file with python

I got a test for job application, my deal is read some .doc files. Does anyone know a library to do this? I had started with a raw python code:

f = open('test.doc', 'r') f.read() 

but this does not return a friendly string I need to convert it to utf-8

Edit: I just want get the text from this file

like image 598
Italo Lemos Avatar asked Mar 15 '16 02:03

Italo Lemos


People also ask

How do I read a .DOC file in Python?

You can use python-docx2txt library to read text from Microsoft Word documents. It is an improvement over python-docx library as it can, in addition, extract text from links, headers and footers. It can even extract images. You can install it by running: pip install docx2txt .

Can I read a docx file in Python?

With Python-Docx, your Python programs will now be able to read the text from a . docx file and use it just like any other string value.

How do I read DOC files?

How to open a DOC file. You can open DOC files with Microsoft Word in Windows and macOS. Word is the best application for opening DOC files because it fully supports the formatting of Word documents, which includes text spacing and alignment, images, charts, and tables.

How do I extract text from a word document in Python?

To extract text from MS word files in Python, we can use the zipfile library. to create ZipFile object with the path string to the Word file. Then we call read with 'word/document. xml' to read the Word file.


2 Answers

One can use the textract library. It take care of both "doc" as well as "docx"

import textract text = textract.process("path/to/file.extension") 

You can even use 'antiword' (sudo apt-get install antiword) and then convert doc to first into docx and then read through docx2txt.

antiword filename.doc > filename.docx 

Ultimately, textract in the backend is using antiword.

like image 95
Shivam Kotwalia Avatar answered Sep 30 '22 02:09

Shivam Kotwalia


You can use python-docx2txt library to read text from Microsoft Word documents. It is an improvement over python-docx library as it can, in addition, extract text from links, headers and footers. It can even extract images.

You can install it by running: pip install docx2txt.

Let's download and read the first Microsoft document on here:

import docx2txt my_text = docx2txt.process("test.docx") print(my_text) 

Here is a screenshot of the Terminal output the above code:

enter image description here

EDIT:

This does NOT work for .doc files. The only reason I am keep this answer is that it seems there are people who find it useful for .docx files.

like image 42
Billal Begueradj Avatar answered Sep 30 '22 02:09

Billal Begueradj