Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there an easy way to manually decode a FlateDecode Filter to extract text in a PDF? C#

I posted a question related to this a while back but got no responses. Since then, I've discovered that the PDF is encoded using FlateDecode, and I was wondering if there is a way to manually decode the PDF in C# (Windows Phone 8)? I'm getting output like the following:

%PDF-1.5
%????
1 0 obj
<<
/Type /Catalog
/Pages 2 0 R
>>
endobj
5 0 obj
<<
/Filter /FlateDecode
/Length 9
>>
stream x^+

The PDF has been created using the SyncFusion PDF controls for Windows Phone 8. Unfortunately, they do not currently have a text extraction feature, and I couldn't find that feature in other WP PDF controls either.

Basically, all I want is to download the PDF from OneDrive and read the PDF contents. Curious if this is easily doable?

like image 855
greentea Avatar asked Sep 11 '14 23:09

greentea


2 Answers

private static string decompress(byte[] input)
{
    byte[] cutinput = new byte[input.Length - 2];
    Array.Copy(input, 2, cutinput, 0, cutinput.Length);

    var stream = new MemoryStream();

    using (var compressStream = new MemoryStream(cutinput))
    using (var decompressor = new DeflateStream(compressStream, CompressionMode.Decompress))
        decompressor.CopyTo(stream);

    return Encoding.Default.GetString(stream.ToArray());
}

According to below similar question the first 2 bytes of the stream has to be cut from the stream. This is done in above function. Just pass all bytes of the stream to input. Make sure the bytecount is the same as the length specified.

C# decode (decompress) Deflate data of PDF File

like image 165
Pete Avatar answered Oct 19 '22 05:10

Pete


The easiest solution is to use DeflateStream provided by .NET framework. Example can be found in similar thread. This approach might have some pitfalls.

If this doesn't work, there are libraries (like DotNetZip), capable of deflate stream decompression. Please check this link for performance comparison.

The last possible option I see, without reinventing wheel is to use other PDF parsing libraries and use them for stream decompression, or even for whole PDF processing.

like image 1
Gotcha Avatar answered Oct 19 '22 03:10

Gotcha