Is it possible to get structural elements from a PDF file using iTextSharp?

Tags:

We are using iTextSharp with a C# WinForms application to parse a PDF file. Using iTextSharp, I can easily extract the text data from the PDF file. Suppose a PDF file contains an image surrounded by two lines of text. In this case, I could not extract the information about the image.

My requirement is:

Get structural elements of the PDF file
Process whether each is of type text, image, table or other

For example, the structural elements are similar to the following:

text :paragraph1
text :paragraph2
Image:Image
text :paragraph3
Table:table info
text :Paragraph4

If I can obtain information in a format like this, I can easily understand the text, image, table, header or footer information.

So, is it possible to get this kind of information using iTextSharp? If yes, please enlighten me on this. Otherwise, could you please suggest some other tools capable of meeting this requirement?

Thanks to all,

Saravanan

625

asked Feb 16 '12 07:02

Saravanan

1 Answers

I used to have this kind of need a while ago. I used this function (from Extract images using iTextSharp) :

private static PdfObject FindImageInPDFDictionary(PdfDictionary pg)
{
    PdfDictionary res =
        (PdfDictionary)PdfReader.GetPdfObject(pg.Get(PdfName.RESOURCES));


    PdfDictionary xobj =
      (PdfDictionary)PdfReader.GetPdfObject(res.Get(PdfName.XOBJECT));
    if (xobj != null)
    {
        foreach (PdfName name in xobj.Keys)
        {

            PdfObject obj = xobj.Get(name);
            if (obj.IsIndirect())
            {
                PdfDictionary tg = (PdfDictionary)PdfReader.GetPdfObject(obj);

                PdfName type =
                  (PdfName)PdfReader.GetPdfObject(tg.Get(PdfName.SUBTYPE));

                //image at the root of the pdf
                if (PdfName.IMAGE.Equals(type))
                {
                    return obj;
                }// image inside a form
                else if (PdfName.FORM.Equals(type))
                {
                    return FindImageInPDFDictionary(tg);
                } //image inside a group
                else if (PdfName.GROUP.Equals(type))
                {
                    return FindImageInPDFDictionary(tg);
                }

            }
        }
    }

    return null;
}

As you can see in the foreach (PdfName name in xobj.Keys) statement, I think you can easily parse a whole PDF and treat every kind of data from it. But I'm not sure about the "verticality" part of your need.

Hope it could help you.

115

answered Sep 30 '22 16:09

cubitouch

Related questions
                            
                                HttpClient Instancing Per Service-Endpoint
                            
                                ZeroMQ PUB/SUB Pattern with Multi-Threaded Poller Cancellation
                            
                                Smart Card Reader, can't read some cards
                            
                                How to check similarity of two Xml trees (Tree Edit Distance in C#)
                            
                                dotnet pack project references
                            
                                Drag and Drop to a hosted Browser control
                            
                                Saving JPEG file coming from Network Camera RTP Stream
                            
                                Which is faster: Automapper, Valuinjector, or manual mapping? To what degree is each one faster? [closed]
                            
                                Windows 8 - .NET TCP AcceptAsync callback not firing (blocked by Console.ReadLine())
                            
                                Does LINQ to Objects keep its order
                            
                                C# string.IndexOf() returns unexpected value
                            
                                Why does Single() not return directly when more than one element is found? [duplicate]
                            
                                Using Dataset for Microsoft reporting
                            
                                C# Visual Studio 2015: IWebProxy certificate validation
                            
                                How to create or use ready Shims for porting from .net framework to .net core / standard?
                            
                                WPF Path disappears at some size
                            
                                Why is anonymous user trying to access /admin/host/synctriggers?
                            
                                How to use Ninject Conventions extension without referencing Assembly (or Types within it)
                            
                                Is it okay to not close StreamReader/StreamWriter to keep the underlying stream open?
                            
                                How to make my application be considered as a communication program in Windows

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is it possible to get structural elements from a PDF file using iTextSharp?

Tags:

c#

pdf

c#-4.0

itextsharp

Saravanan

People also ask

1 Answers

cubitouch

Recent Activity

Donate For Us