Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing simple MIME files from C/C++?

Tags:

c++

parsing

mime

I have searched the web for days now but I can't seem to find a good solution to my problem:

For one of my projects I'm looking for a good (lightweight) MIME parser. My customer provides MIME formatted files (linear, no hierarchy) which contain 3-4 "parts". The application must be able to split those parts and process them independently.

Basically those MIME files are like raw E-Mail messages, but without the SMTP-headers. Instead they begin with the MIME-Header "MIME-Version: 1.0" and after that the parts follow.

I am using C++ for the application, so a C++ library is welcome. A standard C library is welcome, too; but it should fit the following criteria:

  • Be open (at least LGPL), not properiaty
  • Compact - I just need the parser, no SMTP/POP3 support
  • Cross-Platform (targeting Windows, Mac OS X and Linux)

After days of searching I found the following libs and reasons why to not use them:

  • mimetic (C++) --- Although this library seems complete and for C++ usage, it is based on glib, which won't properly compile on Windows.
  • Vmime (C++) --- Seems complete, but there is no official Windows support. Also they provide "dual licensing" ("commerical LGPL" + GPL). Seems to be included with Ubuntu and Debian, but the licensing is confusing.
  • mime++ --- Commerical, no Mac support.
  • Chilkat Software MIME C++ Library --- Commerical and focused on Windows.

I don't really want to write my own MIME parser. MIME is so widespread that there must be some open library to handle this file format in a sane way.

So, do you guys have any ideas, suggestions or links?

Thanks in advance!

like image 250
BastiBen Avatar asked Jun 14 '10 14:06

BastiBen


1 Answers

GMime is an LGPL mime parser written in C. It does depend on glib, but glib is available on Windows: 32bit and 64bit (and all Unix-based platforms, including Mac OS X). It also builds inside Visual Studio afaict, so I fail to see what the problem is. I know there is at least 1 commercial Windows vendor shipping libgmime.dll and libglib.dll in their product (Kerio Connect, iirc). Nokia even ships it on some of their phones.

There is really no such thing as a "lightweight" mime parser if you actually expect it to do anything more than than split headers on ':' and and do haphazard parsing of the Content-Type header to look for a boundary string and then go on to handle non-nested multiparts (kinda useless outside of parsing http responses and pre-canned mime messages that you control the composition of).

The reason that parsers like GMime are so "large", as far as lines of code goes, is because they are meant for developers that actually want correct and robust mime-part and header parsing/decoding. See my rant about decoding rfc2047 encoded-word tokens for an idea about how complex this can get (btw, other than GMime and MimeKit, I have yet to find any open source mime parsers capable of handling all of the edge cases discussed in my rant).

Even with all of this extra robust processing, it's still as fast or faster than most "lightweight" mime parsers are likely to be, especially considering most of them use a readline approach. I've seen "lightweight" mime parsers purport to parse 25MB email files in 2-3 seconds and consider that to be "fast". My unit tests for GMime parse 2 mbox files full of messages larger than 1.2GB (yes, gigabytes) in less time than that.

My point is that "lightweight" is a bullshit criteria by people who don't know what they are talking about.

How about judging based on something meaningful such as rfc compliance? Or by a combination of rfc compliance and performance? Either way, GMime will come out a winner in any meaningful comparison you make.

like image 143
jstedfast Avatar answered Nov 16 '22 00:11

jstedfast