Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Detect if .csv file is actually .xls (Excel) file

I'd like to detect if a user accidentally uploads an Excel file marked as .csv. Is there a standard binary footprint for xls files that would make this possible?

like image 219
Rafael Avatar asked Nov 04 '22 17:11

Rafael


1 Answers

You can read excel files in python:

http://scienceoss.com/read-excel-files-from-python/

You can read excel files in Perl:

http://www.thegeekstuff.com/2011/12/perl-and-excel/

How can I read Excel files in Perl?

The Unix/Linux utility file can recognize excel and a large number of other files.

Sample output:

file ~/Download/*xls

/home/paul/Downloads/REDACTED1.xls:          Composite Document File V2 Document, Little Endian, Os: Windows, Version 5.1, Code page: 1252, Author: Someones Name, Last Saved By: Somebody Else, Name of Creating Application: Microsoft Excel, Create Time/Date: Wed Jan 27 00:39:46 2010, Last Saved Time/Date: Sun Feb 28 13:55:47 2010, Security: 0

/home/paul/Downloads/REDACTED2.xls: Composite Document File V2 Document, Little Endian, Os: Windows, Version 1.0, Code page: -535, Author: Paul , Last Saved By: Paul , Revision Number: 3, Total Editing Time: 18:09, Create Time/Date: Wed Oct 26 23:45:51 2011, Last Saved Time/Date: Thu Oct 27 00:34:42 2011

You could simply build a library that calls file and returns the result.

To see how file does it, source code is available, and the file utility has its own configuration file and even a configuration directory of magic byte and string info.

apt-get source file

./file-5.11/magic/MagDir is a directory full of magic bytes and strings to look for in a large variety of formats, but "Composite Document File" seen in the scan of my own excel files was not declared there. This dir does have definition files for Excel on Mac, and Word, and some old msdos formats.

cd ./file-5.11; grep 'Composite Document File' */*

yields:

src/cdf.c: * Parse Composite Document Files, the format used in Microsoft Office
src/cdf.c: * N.B. This is the "Composite Document File" format, and not the
src/cdf.h: * Parse Composite Document Files, the format used in Microsoft Office
src/cdf.h: * N.B. This is the "Composite Document File" format, and not the
src/readcdf.c:                if (file_printf(ms, "Composite Document File V2 Document")
src/readcdf.c:          if (file_printf(ms, "Composite Document File V2 Document")

which I would suggest you could investigate to determine how the file utility is able to detect some of the Microsoft Excel formats.

like image 60
Paul Avatar answered Dec 06 '22 19:12

Paul