Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to extract Meta information from MS office files and/or PDFs with PHP?

So I have files....

.doc
.docx
.xls
.xlsx
and .pdf

that are on the my server.

Is it possible (and if it is, how) to extract the meta data from those files using PHP? I'm looking for things like Author, keywords, title, etc...

In office documents it's the information stored along with the document properties (File...Properties...Summary for 2003, Prepare...Properties for 2007).

In PDFs it's information found in Document Properties.

This is not on a Windows server.

like image 839
Jason Avatar asked Jan 19 '10 18:01

Jason


People also ask

How do I extract meta data from a PDF?

How to view PDF metadata? Open the concerned PDF document in Adobe Acrobat and go to File > Properties > Description. It will show you a window that consists of different components of the metadata of the concerned PDF document.

Which of the following file format can be used to extract metadata?

At the time of this writing, MetaGooFil was capable of extracting metadata from the following formats: pdf, doc, xls, ppt, odp, ods, docx, xlsx, and pptx. You can enter multiple file types by separating each type with a comma (but no spaces).


1 Answers

I have managed to extract a lot of Meta information using XPDF on a linux system a few years back. Nowadays, though, I would say Zend_PDF is your best bet. Haven't used it myself but looks good and promises everything you need. Seems to have no library dependencies, either.

For Word .DOCs, if you don't find a better way, plug into an OpenOffice server instance / command line and convert the files to ODT, which is XML and parseable. If it's not possible to extract the meta data per Macro - it should be, but I don't know how much work it is. This OpenOffice Forum entry gives a ton of starting points for automated conversion.

The ...X formats are some sort of XML, so it should be easily possible to fetch the meta data from them. Alternatively, you should be able to use OpenOffice's conversion filters here as well, if they transport the meta data.

like image 102
Pekka Avatar answered Oct 25 '22 08:10

Pekka