Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does the .doc format work?

Tags:

format

zip

docx

doc

I recently learned about the basic structure of the .docx file (it's a specially structured zip archive). However, docx is not formated like a doc.

How does a doc file work? What is the file format, structure, etc?

like image 997
stalepretzel Avatar asked Sep 24 '08 01:09

stalepretzel


People also ask

How do you format a DOC file?

You can use the formatting options available on the Home tab to format text. Change the font: Select some text and then tap the font name box. Choose a font from the drop-down list. Change the font size: Select some text, tap the font size box, and then choose a size from the drop-down list.

What does .DOC mean in word document?

doc (an abbreviation of "document") is a filename extension used for word processing documents stored on Microsoft's proprietary Microsoft Word Binary File Format. Microsoft has used the extension since 1983. Word Document.

Is .DOC a binary file?

DOC is a subtype of the Compound File Binary File Format. Subsidiary specifications describe data types and data structures that are used in common by documentation for Microsoft Office 97, Microsoft Office XP, Microsoft Office 2003, and the 2007 Microsoft Office system.

What is a doc file and how do I open it?

A DOC file is a Microsoft Word Document file. Open one with MS Word or for free through Google Docs or WPS Office. Convert to PDF, JPG, DOCX, etc.


4 Answers

It's not a direct answer to your question, but I highly recommend reading Joel Spolsky's article, Why are the Microsoft Office file formats so complicated? (And some workarounds). It will give you some insight into how complex the .doc format really is - and why. Joel also gives a very basic overview of what the .doc format consists of:

You see, Excel 97-2003 files are OLE compound documents, which are, essentially, file systems inside a single file. These are sufficiently complicated that you have to read another 9 page spec to figure that out. And these “specs” look more like C data structures than what we traditionally think of as a spec. It's a whole hierarchical file system.

(The quote refers to Excel files but it applies to Word docs as well). Informative article and helpful in understanding why .docx and ODF files are structured and designed so much more logically when being examined from an outside perspective.

like image 95
Jay Avatar answered Sep 25 '22 00:09

Jay


The full format for binary .doc files is documented in this pdf from (the Wikipedia article on .doc)

like image 22
John Millikin Avatar answered Sep 26 '22 00:09

John Millikin


The basic idea behind the MS Word DOC format is an OLE Compund Document which, as Kibbee has already written, is basically a memory dump. It's a very complex and convoluted way to store documents, but if you've ever really dug into the application Word you'll know how insanely many features it has, and if you have used it in a business setting you'll have a good feeling for how it integrates with other programs in the Office series.

In general, OLE Compund Documents are very extensible structures that allows you to stuff all kinds of data into one file and even to some degree handle data you don't have an application installed for. For example, if you insert an Equation object (from the MS Equation Editor) into a document it gets stored as a sub-object which is like a file inside the file, but this object doesn't just contain the data required for Equation Editor to edit and render it, it also has a generic bitmap (or metafile, maybe) representation stored so it can be displayed, though not edited, on a machine without Equation Editor installed.

This was the why, for the how you'll have to read the specifications other people have linked to already ;)

If you want the easy way out to work with the files though, make sure your software runs on a Windows machine with Word installed, then use COM/OLE Automation to open and manipulate the documents. You won't have to worry about file format then.

like image 22
jfs Avatar answered Sep 26 '22 00:09

jfs


Doc is the binary format of word document - here's the Microsoft Office Word 97-2007 Binary File Format Specification [*.doc] document.

like image 35
RWendi Avatar answered Sep 26 '22 00:09

RWendi