Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract the contents of an OLE container?

I need to break open a MS Word file (.doc) and extract its constituent files ('[1]CompObj', 'WordDocument' etc). Something like 7-zip can be used to do this manually but I need to do this programatically.

I've gathered that a Word document is an OLE container (hence why 7-zip can be used to view its contents) but I can't work out how to (using C++):

  1. open the OLE container
  2. extract each constituent file and save it to disk

I've found a couple of examples of OLE automation (eg here) but what I want to do seems to be less common and I've found no specific examples.

If anyone has any idea of either an API (?!) and tutorial for working with OLE I'd be grateful. Ditto any code samples.

like image 691
Ben L Avatar asked Jun 29 '10 14:06

Ben L


People also ask

How to extract data from an Excel Ole file?

The usual way to extract the content is to open each item individually in Excel, and save them to files. This is a tedious process if you have a lot of records you need to extract. We have 2 products, SQL Image Viewer and Access OLE Export, that can remove the OLE wrappers for data stored in OLE Object fields, and export them to disk.

Why is it difficult to extract data from Ole fields?

However, it is difficult to extract the data from those fields because of the additional OLE information embedded together with your data. For example, let’s create a table in Access, and store a simple Excel workbook, first as an embedded object, and second as an embedded file.

How do I list all the Ole files stored in storages?

olefile.OleFileIO.listdir () returns a list of all the streams contained in the OLE file, including those stored in storages. Each stream is listed itself as a list, as described above. As an option it is possible to choose if storages should also be listed, with or without streams (new in v0.26):

Is a Word document an OLE container?

I've gathered that a Word document is an OLE container (hence why 7-zip can be used to view its contents) but I can't work out how to (using C++): I've found a couple of examples of OLE automation (eg here) but what I want to do seems to be less common and I've found no specific examples.


Video Answer


2 Answers

It is called Compound Files, part of the Structured Storage API. You start with StgOpenStorageEx(). It buys you little for a Word .doc file, the streams themselves have a sophisticated binary format. To really read the document content you want to use automation, letting Word read the file. That's rarely done in C++ but that project shows you how.

like image 179
Hans Passant Avatar answered Oct 19 '22 00:10

Hans Passant


This site http://www.endurasoft.com/vcd/ststo.htm contains both tutorial, API information and code sample that does everything I was looking for.

like image 25
Ben L Avatar answered Oct 18 '22 23:10

Ben L