I need a way to convert .doc
or .docx
extensions to .txt
without installing anything. I also don't want to have to manually open Word to do this obviously. As long as it's running on auto.
I was thinking that either Perl or VBA could do the trick, but I can't find anything online for either.
Any suggestions?
A simple Perl only solution for docx:
Use Archive::Zip to get the word/document.xml
file from your docx
file. (A docx is just a zipped archive.)
Use XML::LibXML to parse it.
Then use XML::LibXSLT to transform it into text or html format. Seach the web to find a nice docx2txt.xsl file :)
Cheers !
J.
Note that you can also use OpenOffice to perform miscellaneous document, drawing, spreadhseet etc. conversions on both Windows and *nix platforms.
You can access OpenOffice programmatically (in a way analogous to COM on Windows) via UNO from a variety of languages for which a UNO binding exists, including from Perl via the OpenOffice::UNO module.
On the OpenOffice::UNO page you will also find a sample Perl scriptlet which opens a document, all you then need to do is export it to txt
by using the document.storeToURL()
method -- see a Python example which can be easily adapted to your Perl needs.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With