Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best way to extract text from a Word doc without using COM/automation?

Tags:

Is there a reasonable way to extract plain text from a Word file that doesn't depend on COM automation? (This is a a feature for a web app deployed on a non-Windows platform - that's non-negotiable in this case.)

Antiword seems like it might be a reasonable option, but it seems like it might be abandoned.

A Python solution would be ideal, but doesn't appear to be available.

like image 998
Kevin Avatar asked Sep 03 '08 20:09

Kevin


People also ask

How do I extract just the text from a Word document?

Open the DOCX file and click on File > Save As > Computer > Browser. Choose to save file as Plain Text (for XLSX files, save it as Text (Tab delimited)). Locate and open the text file with the name you have used to save it. This text file will contain only the text from your original file without any formatting.

How do I extract text from a Word document in Python?

To extract text from MS word files in Python, we can use the zipfile library. to create ZipFile object with the path string to the Word file. Then we call read with 'word/document. xml' to read the Word file.


2 Answers

(Same answer as extracting text from MS word files in python)

Use the native Python docx module which I made this week. Here's how to extract all the text from a doc:

document = opendocx('Hello world.docx')  # This location is where most document content lives  docbody = document.xpath('/w:document/w:body', namespaces=wordnamespaces)[0]  # Extract all text print getdocumenttext(document) 

See Python DocX site

100% Python, no COM, no .net, no Java, no parsing serialized XML with regexs.

like image 180
mikemaccana Avatar answered Oct 01 '22 11:10

mikemaccana


I use catdoc or antiword for this, whatever gives the result that is the easiest to parse. I have embedded this in python functions, so it is easy to use from the parsing system (which is written in python).

import os  def doc_to_text_catdoc(filename):     (fi, fo, fe) = os.popen3('catdoc -w "%s"' % filename)     fi.close()     retval = fo.read()     erroroutput = fe.read()     fo.close()     fe.close()     if not erroroutput:         return retval     else:         raise OSError("Executing the command caused an error: %s" % erroroutput)  # similar doc_to_text_antiword() 

The -w switch to catdoc turns off line wrapping, BTW.

like image 28
codeape Avatar answered Oct 01 '22 13:10

codeape