So I have files....
.doc
.docx
.xls
.xlsx
and .pdf
that are on the my server.
Is it possible (and if it is, how) to extract the meta data from those files using PHP? I'm looking for things like Author, keywords, title, etc...
In office documents it's the information stored along with the document properties (File...Properties...Summary for 2003, Prepare...Properties for 2007).
In PDFs it's information found in Document Properties.
This is not on a Windows server.
How to view PDF metadata? Open the concerned PDF document in Adobe Acrobat and go to File > Properties > Description. It will show you a window that consists of different components of the metadata of the concerned PDF document.
At the time of this writing, MetaGooFil was capable of extracting metadata from the following formats: pdf, doc, xls, ppt, odp, ods, docx, xlsx, and pptx. You can enter multiple file types by separating each type with a comma (but no spaces).
I have managed to extract a lot of Meta information using XPDF on a linux system a few years back. Nowadays, though, I would say Zend_PDF is your best bet. Haven't used it myself but looks good and promises everything you need. Seems to have no library dependencies, either.
For Word .DOCs, if you don't find a better way, plug into an OpenOffice server instance / command line and convert the files to ODT, which is XML and parseable. If it's not possible to extract the meta data per Macro - it should be, but I don't know how much work it is. This OpenOffice Forum entry gives a ton of starting points for automated conversion.
The ...X formats are some sort of XML, so it should be easily possible to fetch the meta data from them. Alternatively, you should be able to use OpenOffice's conversion filters here as well, if they transport the meta data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With