Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scanning a JPEG file for markers

I have a C++ application that has a very straightforward requirement of extracting some meta-data from a JPEG file.

There are various libraries to do this, but initially when prototyping I simply wanted to get things done fast, and since I knew the anatomy of a JPEG file was conveniently delineated by a series of markers, (i.e. {0xFF, 0xXX} tuples with a corresponding length field), I thought it would be easy enough to just iterate over the sections of a JPEG file by starting from the first marker, and iterating from marker to marker until I hit the End-Of-Image marker.

This was easy to implement by simply reading in the JPEG data into an std::vector<unsigned char>, and then just iterating over it, finding marker sections. I eventually abstracted this logic into a "marker-iterator" class that made it even easier to work with.

Generally this works great. In fact, usually the meta-data I'm interested in appears in the first marker after the SOI marker (i.e. the APP0 marker, beginning with { 0xF0, 0xE0 }). So, for the most part I don't even NEED to actually write logic to iterate over the whole JPEG file - I can just check the header which always contains the APP0 marker.

Except then I discovered my assumption was wrong. Apparently, the 0xF0, 0xE0 marker doesn't ALWAYS have to be the first segment.

Okay, no problem - iterating over all the markers is easy enough anyway. Except, then I ran into another problem. For the most part, finding the next marker is as easy as adding a length field to the current index position into the JPEG data buffer. But apparently some length fields don't actually indicate the entire length of a particular segment. For example, the "Start-Of-Scan" segment in a JPEG file is followed by "entropy-coded data". The size of the "entropy-coded data" is not included in the length field.

So ... if you hit a "Start-Of-Scan" marker while iterating over a JPEG file, how do you know where the next marker begins? Do you simply have to do a linear search, byte-by-byte, to find the next 0xFF character?

Actually, that wouldn't work either, because the entropy-coded data itself may contain 0xFF characters. However, apparently it is required by the JPEG standard that any 0xFF byte that appears in the entropy-coded data must be followed by a 0x00 byte to differentiate it from an actual marker.

Okay, so that still doesn't give me any way to find the next marker after the "Start-Of-Scan" section without doing a brute force linear search. Is that the only possible way to do it (without complex parsing logic that is specific for the "Start-Of-Scan" section?)

like image 400
Siler Avatar asked Nov 10 '22 04:11

Siler


1 Answers

So ... if you hit a "Start-Of-Scan" marker while iterating over a JPEG file, how do you know where the next marker begins? Do you simply have to do a linear search, byte-by-byte, to find the next 0xFF character?

In a scan you can have FF00 or a restart marker. Any other FFxx sequence should be the start of the next block.

Also, a JPEG image does not have to have an APP0 marker.

like image 88
user3344003 Avatar answered Nov 14 '22 23:11

user3344003