I need to break open a MS Word file (.doc) and extract its constituent files ('[1]CompObj', 'WordDocument' etc). Something like 7-zip can be used to do this manually but I need to do this programatically.
I've gathered that a Word document is an OLE container (hence why 7-zip can be used to view its contents) but I can't work out how to (using C++):
I've found a couple of examples of OLE automation (eg here) but what I want to do seems to be less common and I've found no specific examples.
If anyone has any idea of either an API (?!) and tutorial for working with OLE I'd be grateful. Ditto any code samples.
The usual way to extract the content is to open each item individually in Excel, and save them to files. This is a tedious process if you have a lot of records you need to extract. We have 2 products, SQL Image Viewer and Access OLE Export, that can remove the OLE wrappers for data stored in OLE Object fields, and export them to disk.
However, it is difficult to extract the data from those fields because of the additional OLE information embedded together with your data. For example, let’s create a table in Access, and store a simple Excel workbook, first as an embedded object, and second as an embedded file.
olefile.OleFileIO.listdir () returns a list of all the streams contained in the OLE file, including those stored in storages. Each stream is listed itself as a list, as described above. As an option it is possible to choose if storages should also be listed, with or without streams (new in v0.26):
I've gathered that a Word document is an OLE container (hence why 7-zip can be used to view its contents) but I can't work out how to (using C++): I've found a couple of examples of OLE automation (eg here) but what I want to do seems to be less common and I've found no specific examples.
It is called Compound Files, part of the Structured Storage API. You start with StgOpenStorageEx(). It buys you little for a Word .doc file, the streams themselves have a sophisticated binary format. To really read the document content you want to use automation, letting Word read the file. That's rarely done in C++ but that project shows you how.
This site http://www.endurasoft.com/vcd/ststo.htm contains both tutorial, API information and code sample that does everything I was looking for.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With