Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract just plain text from .doc & .docx files? [closed]

Anyone know of anything they can recommend in order to extract just the plain text from a .doc or .docx?

I've found this - wondered if there were any other suggestions?

like image 759
docextract Avatar asked Apr 15 '11 03:04

docextract


People also ask

How do I extract just the text from a Word document?

Open the DOCX file and click on File > Save As > Computer > Browser. Choose to save file as Plain Text (for XLSX files, save it as Text (Tab delimited)). Locate and open the text file with the name you have used to save it. This text file will contain only the text from your original file without any formatting.

How do I convert a Word document to plain text?

In a Windows Microsoft Word document, click the Save As button from the File menu. Select Save As Type from the drop-down list then select Plain Text (*. txt). Click the Save button and a File Conversion window will open.

How do I extract a .DOC File?

To extract the contents of the file, right-click on the file and select “Extract All” from the popup menu. On the “Select a Destination and Extract Files” dialog box, the path where the content of the .

How do I extract text from a Word document in Python?

To extract text from MS word files in Python, we can use the zipfile library. to create ZipFile object with the path string to the Word file. Then we call read with 'word/document. xml' to read the Word file.


1 Answers

If you want the pure plain text(my requirement) then all you need is

unzip -p some.docx word/document.xml | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g' 

Which I found at command line fu

It unzips the docx file and gets the actual document then strips all the xml tags. Obviously all formatting is lost.

like image 103
rob Avatar answered Sep 17 '22 08:09

rob