Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

extracting text from MS word files in python

for working with MS word files in python, there is python win32 extensions, which can be used in windows. How do I do the same in linux? Is there any library?

like image 725
Badri Avatar asked Sep 24 '08 03:09

Badri


People also ask

How do I extract text from a Word document in python?

Python Library to Extract Text from Word Documents Moreover, it lets you manipulate the existing Word documents for encryption, conversion, text extraction, etc. We will use this library to extract text from the Word DOCX or DOC documents. You can install the library from PyPI using the following pip command.

How do I extract text from a Word document?

To extract the contents of the file, right-click on the file and select “Extract All” from the popup menu. On the “Select a Destination and Extract Files” dialog box, the path where the content of the .

How do I read the contents of a docx file in python?

Reading Word Documents docx file in Python, call docx. Document() , and pass the filename demo. docx. This will return a Document object, which has a paragraphs attribute that is a list of Paragraph objects.

How do I open a Microsoft Word document in python?

To read a word document we take help of the module named docx. We first install docx as shown below. Then write a program to use the different functions in docx module to read the entire file by paragraphs.


2 Answers

Use the native Python docx module. Here's how to extract all the text from a doc:

document = docx.Document(filename) docText = '\n\n'.join(     paragraph.text for paragraph in document.paragraphs ) print(docText) 

See Python DocX site

Also check out Textract which pulls out tables etc.

Parsing XML with regexs invokes cthulu. Don't do it!

like image 90
mikemaccana Avatar answered Sep 20 '22 23:09

mikemaccana


You could make a subprocess call to antiword. Antiword is a linux commandline utility for dumping text out of a word doc. Works pretty well for simple documents (obviously it loses formatting). It's available through apt, and probably as RPM, or you could compile it yourself.

like image 42
John Fouhy Avatar answered Sep 22 '22 23:09

John Fouhy