Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert Word doc or docx files into text files?

I need a way to convert .doc or .docx extensions to .txt without installing anything. I also don't want to have to manually open Word to do this obviously. As long as it's running on auto.

I was thinking that either Perl or VBA could do the trick, but I can't find anything online for either.

Any suggestions?

like image 762
CheeseConQueso Avatar asked Jul 10 '09 15:07

CheeseConQueso


2 Answers

A simple Perl only solution for docx:

  1. Use Archive::Zip to get the word/document.xml file from your docx file. (A docx is just a zipped archive.)

  2. Use XML::LibXML to parse it.

  3. Then use XML::LibXSLT to transform it into text or html format. Seach the web to find a nice docx2txt.xsl file :)

Cheers !

J.

like image 73
jeje Avatar answered Oct 16 '22 22:10

jeje


Note that you can also use OpenOffice to perform miscellaneous document, drawing, spreadhseet etc. conversions on both Windows and *nix platforms.

You can access OpenOffice programmatically (in a way analogous to COM on Windows) via UNO from a variety of languages for which a UNO binding exists, including from Perl via the OpenOffice::UNO module.

On the OpenOffice::UNO page you will also find a sample Perl scriptlet which opens a document, all you then need to do is export it to txt by using the document.storeToURL() method -- see a Python example which can be easily adapted to your Perl needs.

like image 31
vladr Avatar answered Oct 16 '22 22:10

vladr