Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Determine if the document is DOC or DOCX in Java app without knowing its extension

Tags:

java

docx

doc

There is a constraint in the content management system that requires to store all word documents with specific extension (different from DOC or DOCX). However, when outputting the document to user we need to know if it is a DOC or DOCX file in order to provide the right MIME type.

So, is there a way to programatically find out if document is DOC or DOCX by its content?

like image 910
Andriy Avatar asked Jun 11 '10 14:06

Andriy


2 Answers

Here is a link to the ForensicsWiki which details lots of different file types. It describes the headers of both DOC and DOCX files, so you should be able to parse the files and determine what kind they are.

Looking at the link, .doc files are OLE Compound Files, the file should have the following binary header:

d0 cf 11 e0 a1 b1 1a e1

In constrast, .docx files will have the binary signature:

50 4b
like image 142
samoz Avatar answered Sep 23 '22 02:09

samoz


DOCX files are in ZIP format, in which the first two bytes are the letters PK (after ZIP's creator, Phil Katz).

like image 44
RichieHindle Avatar answered Sep 24 '22 02:09

RichieHindle