hello I'm trying to validate an uploaded file type by finfo_file function.
But when a .docx file is sent, the file type is:
application/zip
instead of:
application/vnd.openxmlformats-officedocument.wordprocessingml.document
how can I change this behavior?
A Docx file comprises of a collection of XML files that are contained inside a ZIP archive. The contents of a new Word document can be viewed by unzipping its contents. The collection contains a list of XML files that are categorized as: MetaData Files - contains information about other files available in the archive.
Unlike DOC files, which store document data in a single binary file, DOCX files save data as separate files and folders in a compressed Zip package. Within a DOCX file are XML files and three folders: Word, docProps, and _rels, which store the content, document properties, and relationships between the files.
As far as I now the vendor specific file types (vnd.) are not standardized (by any RFC) and therefore not covered by file_info(). .docx
is a zipped xml-format and thats the reason, why file_info()
returns application_zip
(what is completely right). You may unzip the file and test the mime-type of the result, but that will lead to xml
(what is completely correct too) and other files, that are used by the document. To differ between different XML formats file_info()
had to analyze its content and it must know, how it looks, what goes just to far.
This works on debian. Add this to /etc/magic:
#------------------------------------------------------------------------------
# $File: msooxml,v 1.1 2011/01/25 18:36:19 christos Exp $
# msooxml: file(1) magic for Microsoft Office XML
# From: Ralf Brown <[email protected]>
# .docx, .pptx, and .xlsx are XML plus other files inside a ZIP
# archive. The first member file is normally "[Content_Types].xml".
# Since MSOOXML doesn't have anything like the uncompressed "mimetype"
# file of ePub or OpenDocument, we'll have to scan for a filename
# which can distinguish between the three types
# start by checking for ZIP local file header signature
0 string PK\003\004
# make sure the first file is correct
>0x1E string [Content_Types].xml
# skip to the second local file header
# since some documents include a 520-byte extra field following the file
# header, we need to scan for the next header
>>(18.l+49) search/2000 PK\003\004
# now skip to the *third* local file header; again, we need to scan due to a
# 520-byte extra field following the file header
>>>&26 search/1000 PK\003\004
# and check the subdirectory name to determine which type of OOXML
# file we have
>>>>&26 string word/ Microsoft Word 2007+
!:mime application/msword
>>>>&26 string ppt/ Microsoft PowerPoint 2007+
!:mime application/vnd.ms-powerpoint
>>>>&26 string xl/ Microsoft Excel 2007+
!:mime application/vnd.ms-excel
>>>>&26 default x Microsoft OOXML
!:strength +10
Then, tell php to use /etc/magic as it's database:
$finfo = finfo_open(FILEINFO_MIME,"/etc/magic");
This is because a DOCX is a ZIP file:
An Office Open XML file is a ZIP-compatible OPC package containing XML documents and other resources.
Like Open Office files, the documents are ZIPs containing various resources in a structured and well-defined manner. So when you try to identify the file content, you first see that it is a ZIP file. You would then need to look inside the ZIP to decide whether it's a DOCX or OpenOffice file.
As an alternative, you could have a look at the file extension: if you identify the file to be a ZIP and the extension happens to be .doc
or .docx
then you can assume it to be an OOXML file.
See my answer in this thread:
Overview
PHP uses libmagic. When Magic detects the MIME type as "application/zip" instead of "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", this is because the files added to the ZIP archive need to be in a certain order.
This causes a problem when uploading files to services that enforce matching file extension and MIME type. For example, Mediawiki-based wikis (written using PHP) are blocking certain XLSX files from being uploaded because they are detected as ZIP files.
What you need to do is fix your XLSX by reordering the files written to the ZIP archive so that Magic can detect the MIME type properly.
...
The post continues to analyze the file and develop a solution by rewriting the file.
Here is the file list for a DOCX file created using Word.
$ unzip -l Word.docx
Archive: Word.docx
Length Date Time Name
--------- ---------- ----- ----
1364 1980-01-01 00:00 [Content_Types].xml
734 1980-01-01 00:00 _rels/.rels
817 1980-01-01 00:00 word/_rels/document.xml.rels
1823 1980-01-01 00:00 word/document.xml
6799 1980-01-01 00:00 word/theme/theme1.xml
2068 1980-01-01 00:00 docProps/thumbnail.emf
2652 1980-01-01 00:00 word/settings.xml
1954 1980-01-01 00:00 word/fontTable.xml
576 1980-01-01 00:00 word/webSettings.xml
735 1980-01-01 00:00 docProps/core.xml
28979 1980-01-01 00:00 word/styles.xml
709 1980-01-01 00:00 docProps/app.xml
--------- -------
49210 12 files
You may have to imitate that file order or try writing the "[Content_Types].xml", "word/document.xml", and "word/styles.xml" files first before other files.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With