We are using iTextSharp with a C# WinForms application to parse a PDF file. Using iTextSharp, I can easily extract the text data from the PDF file. Suppose a PDF file contains an image surrounded by two lines of text. In this case, I could not extract the information about the image.
My requirement is:
For example, the structural elements are similar to the following:
text :paragraph1
text :paragraph2
Image:Image
text :paragraph3
Table:table info
text :Paragraph4
If I can obtain information in a format like this, I can easily understand the text, image, table, header or footer information.
So, is it possible to get this kind of information using iTextSharp? If yes, please enlighten me on this. Otherwise, could you please suggest some other tools capable of meeting this requirement?
Thanks to all,
Saravanan
Itextsharp is an advanced tool library which is used for creating complex pdf repors. itext is used by different techonologies -- Android , . NET, Java and GAE developer use it to enhance their applications with PDF functionality.
A Chunk is the smallest significant piece of text that you can work with. It's ASP.NET equivalent is the <asp:Label>. As with the Label, you need to be careful how you use Chunks. The following snippet shows how to set the text of a Chunk, then write it to the PDF document 3 times: string path = Server.MapPath("PDFs");
I used to have this kind of need a while ago. I used this function (from Extract images using iTextSharp) :
private static PdfObject FindImageInPDFDictionary(PdfDictionary pg)
{
PdfDictionary res =
(PdfDictionary)PdfReader.GetPdfObject(pg.Get(PdfName.RESOURCES));
PdfDictionary xobj =
(PdfDictionary)PdfReader.GetPdfObject(res.Get(PdfName.XOBJECT));
if (xobj != null)
{
foreach (PdfName name in xobj.Keys)
{
PdfObject obj = xobj.Get(name);
if (obj.IsIndirect())
{
PdfDictionary tg = (PdfDictionary)PdfReader.GetPdfObject(obj);
PdfName type =
(PdfName)PdfReader.GetPdfObject(tg.Get(PdfName.SUBTYPE));
//image at the root of the pdf
if (PdfName.IMAGE.Equals(type))
{
return obj;
}// image inside a form
else if (PdfName.FORM.Equals(type))
{
return FindImageInPDFDictionary(tg);
} //image inside a group
else if (PdfName.GROUP.Equals(type))
{
return FindImageInPDFDictionary(tg);
}
}
}
}
return null;
}
As you can see in the foreach (PdfName name in xobj.Keys)
statement, I think you can easily parse a whole PDF and treat every kind of data from it. But I'm not sure about the "verticality" part of your need.
Hope it could help you.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With