I'd like to detect if a user accidentally uploads an Excel file marked as .csv. Is there a standard binary footprint for xls files that would make this possible?
You can read excel files in python:
http://scienceoss.com/read-excel-files-from-python/
You can read excel files in Perl:
http://www.thegeekstuff.com/2011/12/perl-and-excel/
How can I read Excel files in Perl?
The Unix/Linux utility file
can recognize excel and a large number of other files.
Sample output:
file ~/Download/*xls
/home/paul/Downloads/REDACTED1.xls: Composite Document File V2 Document, Little Endian, Os: Windows, Version 5.1, Code page: 1252, Author: Someones Name, Last Saved By: Somebody Else, Name of Creating Application: Microsoft Excel, Create Time/Date: Wed Jan 27 00:39:46 2010, Last Saved Time/Date: Sun Feb 28 13:55:47 2010, Security: 0
/home/paul/Downloads/REDACTED2.xls: Composite Document File V2 Document, Little Endian, Os: Windows, Version 1.0, Code page: -535, Author: Paul , Last Saved By: Paul , Revision Number: 3, Total Editing Time: 18:09, Create Time/Date: Wed Oct 26 23:45:51 2011, Last Saved Time/Date: Thu Oct 27 00:34:42 2011
You could simply build a library that calls file
and returns the result.
To see how file
does it, source code is available, and the file
utility has its own configuration file and even a configuration directory of magic byte and string info.
apt-get source file
./file-5.11/magic/MagDir is a directory full of magic bytes and strings to look for in a large variety of formats, but "Composite Document File" seen in the scan of my own excel files was not declared there. This dir does have definition files for Excel on Mac, and Word, and some old msdos formats.
cd ./file-5.11; grep 'Composite Document File' */*
yields:
src/cdf.c: * Parse Composite Document Files, the format used in Microsoft Office
src/cdf.c: * N.B. This is the "Composite Document File" format, and not the
src/cdf.h: * Parse Composite Document Files, the format used in Microsoft Office
src/cdf.h: * N.B. This is the "Composite Document File" format, and not the
src/readcdf.c: if (file_printf(ms, "Composite Document File V2 Document")
src/readcdf.c: if (file_printf(ms, "Composite Document File V2 Document")
which I would suggest you could investigate to determine how the file
utility is able to detect some of the Microsoft Excel formats.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With