Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Correct way to distinguish .xls from .doc file?

I searched how to detect that file is .xls and I've found a solution like this (but not deprecated):
POIFSFileSystem:

@Deprecated
@Removal(version="4.0")
public static boolean hasPOIFSHeader(InputStream inp) throws IOException {
    return FileMagic.valueOf(inp) == FileMagic.OLE2;
}

But this one returns true for all microsoft word documents for example for .doc

Is there a way to detect .xls document?

like image 572
gstackoverflow Avatar asked Mar 09 '23 00:03

gstackoverflow


2 Answers

Both .doc/.xls documents can are stored in the OLE2 storage format. The org.apache.poi.poifs.filesystem.FileMagic helps you to detect the file storage format only and not sufficient alone to distinguish between .doc/.xls files.

Also it does not appear that there is any direct API available in POI library to determine the document type (excel or document) for given inputstream/file.

Below example my be helpful to determine if given stream is a valid .xls (or .xlsx)file with the caveat that it read the given inputstram and close it.

    // slurp content from given input and close it
    public static boolean isExcelFile(InputStream in) throws IOException {
        try {
            // it slurp the input stream
            Workbook workbook = org.apache.poi.ss.usermodel.WorkbookFactory.create(in);
            workbook.close();
            return true;

        } catch (java.lang.IllegalArgumentException | org.apache.poi.openxml4j.exceptions.InvalidFormatException e) {
            return false;
        }
    }

You may found more information on excel file format on this link

Update Solution based on Apache Tika as suggested by gagravarr:

public class TikaBasedFileTypeDetector {
    private Tika tika;
    private TemporaryResources temporaryResources;

    public void init() {
        this.tika = new Tika();
        this.temporaryResources = new TemporaryResources();
    }

    // clean up all the temporary resources
    public void destroy() throws IOException {
        temporaryResources.close();
    }

    // return content mime type
    public String detectType(InputStream in) throws IOException {
        TikaInputStream tikaInputStream = TikaInputStream.get(in, temporaryResources);

        return tika.detect(tikaInputStream);
    }

    public boolean isExcelFile(InputStream in) throws IOException{
        // see https://stackoverflow.com/a/4212908/1700467 for information on mimetypes
        String type = detectType(in);
        return type.startsWith("application/vnd.ms-excel") || //for Micorsoft document
                type.startsWith("application/vnd.openxmlformats-officedocument.spreadsheetml"); // for OpenOffice xml format
    }
}

See this answer on mime types.

like image 93
skadya Avatar answered Mar 16 '23 22:03

skadya


You can work with Apache POI's - HSSF module.
That model (library) is written to read and write xls files (and latest for xlsx as well - although these are different languages).
With this code...

InputStream ExcelFileToRead = new FileInputStream("FileNameWithLink.xls");
HSSFWorkbook wb = new HSSFWorkbook(ExcelFileToRead);
HSSFSheet sheet = wb.getSheetAt(0);

...you can detect if it is readable xls file.
Going deeper you can use this code to try reading it etc. Actually that module is really easy to use.
There can be situations that it technically is .xls file, but it may not be readable (there can be various problems with it).
Extra - XSSF is for .xlsx and HSSF is for .xls.

I haven't used other techniques as I always want to be sure that I will be able read that file later.

like image 37
Mike B Avatar answered Mar 16 '23 22:03

Mike B