Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to use libtiff to decode CCITT-encoded data when the length is not known?

Tags:

In the answers to this question: c++ decode CCITT encoded images in pdfs

It is pointed out that libtiff can be used to decode CCITT encoded images. Of course, we must prepend a TIFF header to make the CCITT stream into a valid TIFF file.

However, some images in PDF files are inline images and their lengths are not given, although their width, height, and bit depth are given. The program reading the PDF is expected to decode the CCITT stream, read (width * height * depth) bits of decoded data, and wherever it is after the data have been read, that's the end of the inline image. Then it should go on to the next page marking command, and so on.

This poses a problem. A TIFF image file directory must specify how many bytes there are in each strip of the image data, but we won't know how many bytes of the encoded data actually belong to the image until we've decoded it, but we can't decode the image without using libtiff...

Is there a way to use libtiff here or do we need custom CCITT filter code?

like image 446
Brian Bi Avatar asked Oct 08 '16 01:10

Brian Bi


1 Answers

Strictly speaking (Is it possible to use libtiff...?), yes. It involves some hacking, but not too much.

Fact: the data will be comprised of one strip, since there isn't any offset information, so our only offset is zero. We just need to read the strip in.

Fact: this data is the compression of a W*H 1-bit deep pixel matrix.

Step 1: estimate the maximum possible length of the compressed stream. This comes out at around 15% of W*H, i.e. with W=1000 and H=1000 you get 150000 bytes. This value will always be more than the actual value. If we have a better estimate thanks to having located the proper EI end-image tag, that's even better but not necessary.

Step 2: build a "virtual" TIF file. This will be made up of a header of the form 49 49 2a 00 AA BB CC DD, where 0xDDCCBBAA is the estimated length plus 8; followed by our estimated data stream; followed by a TIFF directory.

Step 3: the TIFF directory will always have the same structure; some values in it are offsets and depend trivially from the IFD position 0xDDCCBBAA. Quoting from the TIFF6 specs (note that byte order is reversed - Motorola, not Intel endian):

TIFF 6.0 Specification Final—June 3, 1992                         20

Putting it all together (along with a couple of less-important fields that are discussed
later), a sample bilevel image file might contain the following fields

A Sample Bilevel TIFF File

Offset Description Value
(hex) (numeric values are expressed in hexadecimal notation)
Header:
0000 Byte Order     4D4D 
0002 42             002A
0004 1st IFD offset 00000014
IFD:
0014 Number of Directory Entries 000C
0016 NewSubfileType              00FE 0004 00000001 00000000
0022 ImageWidth                  0100 0004 00000001 000007D0
002E ImageLength                 0101 0004 00000001 00000BB8
003A Compression                 0103 0003 00000001 8005 0000
0046 PhotometricInterpretation   0106 0003 00000001 0001 0000
0052 StripOffsets                0111 0004 000000BC 000000B6(*1)
005E RowsPerStrip                0116 0004 00000001 00000010
006A StripByteCounts             0117 0003 000000BC 000003A6(*2)
0076 XResolution                 011A 0005 00000001 00000696(*3)
0082 YResolution                 011B 0005 00000001 0000069E(*4)
008E Software                    0131 0002 0000000E 000006A6(*5)
009A DateTime                    0132 0002 00000014 000006B6(*6)
00A6 Next IFD offset             00000000
Values longer than 4 bytes:
(*1) StripOffsets Offset0        00000008
(*2) StripByteCounts Count0
(*3) XResolution 0000012C 00000001
(*4) YResolution 0000012C 00000001
(*5) Software “PageMaker 4.0”
(*6) DateTime “1988:02:18 13:59:59”

In the above, 0xDDCCBBAA is actually 0014 and all the other offsets follow.

I have done some tests using a single-strip TIFFG4 image I've generated with ImageMagick and tiffcp'ed to 1-strip CCITT format. The header there is slightly different (I don't see the Software and Datetime tags that the spec say should be there). Otherwise it checks.

We now have a damaged TIFF image with one overlong strip, and it is in memory.

Using TIFFClientOpen, we can access it as if it was a disk image.

Attempting to read the first strip will now result in an error and the program aborting:

TIFFFillStrip: Read error on strip 0; got 143151 bytes, expected 762826.

By using TIFFSetErrorHandler and TIFFSetErrorHandlerExt we set up ourselves to intercept this error, and parse it, thereby recovering the 143151 information, instead of aborting.

We need to supply the callbacks to TIFFClientOpen, but they're all very easy:

TIFFReadWriteProc readproc(h, *ptr, n) // copy n bytes from FakeBuffer+pos into ptr, update pos to pos + n, ignore h.
TIFFReadWriteProc writeproc            // Throw an error. We don't write
TIFFSeekProc seekproc                  // update pos appropriately
TIFFCloseProc closeproc                // do nothing
TIFFSizeProc sizeproc                  // return total buffer size
TIFFMapFileProc mapproc                // Set to NULL
TIFFUnmapFileProc unmapproc            // Set to NULL

The processing is indeed awkward and convoluted, but as for feasibility, it can be done.

I have run tests in C language, extracting by hand the CCITT stream from an inline-image BI/ID/EI PDF I found online, and reading it as described above.

If I had a sure-fire way of identifying the correct EI - I've dredged up a message by Tilman Hausherr explaining a hack to recognize valid PDF operators following the EI in order to do so, which makes me think there probably aren't many better methods - I could always estimate the correct offset, and directly produce a correct and readable TIFF file from the PDF without even involving libtiff at all.

like image 84
LSerni Avatar answered Sep 25 '22 16:09

LSerni