I got a test for job application, my deal is read some .doc files. Does anyone know a library to do this? I had started with a raw python code:
f = open('test.doc', 'r') f.read()
but this does not return a friendly string I need to convert it to utf-8
Edit: I just want get the text from this file
You can use python-docx2txt library to read text from Microsoft Word documents. It is an improvement over python-docx library as it can, in addition, extract text from links, headers and footers. It can even extract images. You can install it by running: pip install docx2txt .
With Python-Docx, your Python programs will now be able to read the text from a . docx file and use it just like any other string value.
How to open a DOC file. You can open DOC files with Microsoft Word in Windows and macOS. Word is the best application for opening DOC files because it fully supports the formatting of Word documents, which includes text spacing and alignment, images, charts, and tables.
To extract text from MS word files in Python, we can use the zipfile library. to create ZipFile object with the path string to the Word file. Then we call read with 'word/document. xml' to read the Word file.
One can use the textract library. It take care of both "doc" as well as "docx"
import textract text = textract.process("path/to/file.extension")
You can even use 'antiword' (sudo apt-get install antiword) and then convert doc to first into docx and then read through docx2txt.
antiword filename.doc > filename.docx
Ultimately, textract in the backend is using antiword.
You can use python-docx2txt library to read text from Microsoft Word documents. It is an improvement over python-docx library as it can, in addition, extract text from links, headers and footers. It can even extract images.
You can install it by running: pip install docx2txt
.
Let's download and read the first Microsoft document on here:
import docx2txt my_text = docx2txt.process("test.docx") print(my_text)
Here is a screenshot of the Terminal output the above code:
EDIT:
This does NOT work for .doc files. The only reason I am keep this answer is that it seems there are people who find it useful for .docx files.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With