Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to detect additional mime type in Golang

There are http.DetectContentType([]byte) function in net/http package. But only limited number of types are supported. How to add support of docx, doc, xls, xlsx, ppt, pps, odt, ods, odp files not by extension, but by the content. As far as I know, there are some problems, because docx/xlsx/pptx/odp/odt files has the same signature as the zip file (50 4B 03 04).

like image 378
Kokizzu Avatar asked Dec 20 '22 04:12

Kokizzu


2 Answers

Disclaimer: I'm the author of mimetype.

For anyone having the same problem 3 years later, nowadays the packages for mime type detection based on the content are the following:

  • filetype

    • pure go, no c bindings
    • can be extented to detect new mime types
    • has issues with files which pass as more than one mime type (ex: xlsx and docx passing as zip) because it stores matching functions in a map, thus it does not guarantee the order of traversal
    • limited number of detected mime types
  • magicmime

    • needs libmagic-dev installed
    • of the 3, it has highest number of detected mime types
    • can be extended, albeit harder... man magic
    • libmagic is not thread safe
  • mimetype

    • pure go, no c bindings
    • higher number of detected mime types than filetype
    • is thread safe
    • can be extended
like image 60
GabrielVasile Avatar answered Jan 19 '23 00:01

GabrielVasile


For files with x at the end are relatively easy to detect. Just unzip it and read .rels/_rels file. It contains path to the main file in document. It denoted by namespace http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument. Just check its name. It's document.xml for docx, workbook.xml for xlsx and presentation.xml for pptx.

More info here can be found here ECMA-376.

Binary formats harder to detect. Basically you need to read MS-CFB filesystem and check for entries:

  • WordDocument for doc
  • Workbook or Book for xls
  • PowerPoint Document for ppt
  • EncryptedPackage means file is encrypted.
like image 26
pzinovkin Avatar answered Jan 18 '23 23:01

pzinovkin