Is there a reasonable way to extract plain text from a Word file that doesn't depend on COM automation? (This is a a feature for a web app deployed on a non-Windows platform - that's non-negotiable in this case.)
Antiword seems like it might be a reasonable option, but it seems like it might be abandoned.
A Python solution would be ideal, but doesn't appear to be available.
Open the DOCX file and click on File > Save As > Computer > Browser. Choose to save file as Plain Text (for XLSX files, save it as Text (Tab delimited)). Locate and open the text file with the name you have used to save it. This text file will contain only the text from your original file without any formatting.
To extract text from MS word files in Python, we can use the zipfile library. to create ZipFile object with the path string to the Word file. Then we call read with 'word/document. xml' to read the Word file.
(Same answer as extracting text from MS word files in python)
Use the native Python docx module which I made this week. Here's how to extract all the text from a doc:
document = opendocx('Hello world.docx') # This location is where most document content lives docbody = document.xpath('/w:document/w:body', namespaces=wordnamespaces)[0] # Extract all text print getdocumenttext(document)
See Python DocX site
100% Python, no COM, no .net, no Java, no parsing serialized XML with regexs.
I use catdoc or antiword for this, whatever gives the result that is the easiest to parse. I have embedded this in python functions, so it is easy to use from the parsing system (which is written in python).
import os def doc_to_text_catdoc(filename): (fi, fo, fe) = os.popen3('catdoc -w "%s"' % filename) fi.close() retval = fo.read() erroroutput = fe.read() fo.close() fe.close() if not erroroutput: return retval else: raise OSError("Executing the command caused an error: %s" % erroroutput) # similar doc_to_text_antiword()
The -w switch to catdoc turns off line wrapping, BTW.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With