Anyone know of anything they can recommend in order to extract just the plain text from a .doc
or .docx
?
I've found this - wondered if there were any other suggestions?
Open the DOCX file and click on File > Save As > Computer > Browser. Choose to save file as Plain Text (for XLSX files, save it as Text (Tab delimited)). Locate and open the text file with the name you have used to save it. This text file will contain only the text from your original file without any formatting.
In a Windows Microsoft Word document, click the Save As button from the File menu. Select Save As Type from the drop-down list then select Plain Text (*. txt). Click the Save button and a File Conversion window will open.
To extract the contents of the file, right-click on the file and select “Extract All” from the popup menu. On the “Select a Destination and Extract Files” dialog box, the path where the content of the .
To extract text from MS word files in Python, we can use the zipfile library. to create ZipFile object with the path string to the Word file. Then we call read with 'word/document. xml' to read the Word file.
If you want the pure plain text(my requirement) then all you need is
unzip -p some.docx word/document.xml | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g'
Which I found at command line fu
It unzips the docx file and gets the actual document then strips all the xml tags. Obviously all formatting is lost.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With