Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract highlighed text from PDF using iTextSharp?

Tags:

.net

pdf

itext

As per folowing post: iTextSharp PDF Reading highlighed text (highlight annotations) using C#

this code:

for (int i = pageFrom; i <= pageTo; i++) {
    PdfDictionary page = reader.GetPageN(i);
    PdfArray annots = page.GetAsArray(iTextSharp.text.pdf.PdfName.ANNOTS);
    if (annots!=null)
        foreach (PdfObject annot in annots.ArrayList) {
            PdfDictionary annotation = (PdfDictionary)PdfReader.GetPdfObject(annot);
            PdfString contents = annotation.GetAsString(PdfName.CONTENTS);
            // now use the String value of contents
        }
    }
}

is working to extract PDF annotations. But why the same following code is not working for highlight (specifically PdfName.HIGHLIGHT is not working) :

for (int i = pageFrom; i <= pageTo; i++) {
    PdfDictionary page = reader.GetPageN(i);
    PdfArray annots = page.GetAsArray(iTextSharp.text.pdf.PdfName.HIGHLIGHT);
    if (annots!=null)
        foreach (PdfObject annot in annots.ArrayList) {
            PdfDictionary annotation = (PdfDictionary)PdfReader.GetPdfObject(annot);
            PdfString contents = annotation.GetAsString(PdfName.CONTENTS);
            // now use the String value of contents
        }
    }
}
like image 551
John Stevensons Avatar asked Mar 18 '23 01:03

John Stevensons


2 Answers

Please take a look at table 30 in ISO-32000-1 (aka the PDF reference). It is entitled "Entries in a page object". Among these entries, you can find a key named Annots. Its value is:

(Optional) An array of annotation dictionaries that shall contain indirect references to all annotations associated with the page (see 12.5, "Annotations").

You will not find an entry with a key such as Highlight, hence it is only normal that the array that is returned is null when you have this line:

PdfArray annots = page.GetAsArray(iTextSharp.text.pdf.PdfName.HIGHLIGHT);

You need to get the annotations the way you already did:

PdfArray annots = page.GetAsArray(iTextSharp.text.pdf.PdfName.ANNOTS);

Now you need to loop over this array and look for annotations with Subtype equal to Highlight. This type of annotation is listed in table 169 of ISO-32000-1, entitled "Annotation types".

In other words, your assumption that a page dictionary contains entries with key Highlight was wrong and if you read the whole specification, you will also discover another false assumption you've been making. You are falsely assuming that the highlighted text is stored in the Contents entry of the annotations. This reveals a lack of understanding about the nature of annotations versus page content.

The text you are looking for is stored in the content stream of the page. The content stream of the page is independent of the page's annotations. Hence, to get the highlighted text, you need to get the coordinates stored in the Highlight annotation (stored in the QuadPoints array) and you need to use these coordinates to parse the text that is present in the page content at those coordinates.

like image 189
Bruno Lowagie Avatar answered Mar 20 '23 14:03

Bruno Lowagie


Here is complete example of extracting highlighted text using itextSharp

public void GetRectAnno()
{

    string appRootDir = new DirectoryInfo(Environment.CurrentDirectory).Parent.Parent.FullName;

    string filePath = appRootDir + "/PDFs/" + "anot.pdf";

    int pageFrom = 0;
    int pageTo = 0;

    try
    {
        using (PdfReader reader = new PdfReader(filePath))
        {
            pageTo = reader.NumberOfPages;
            
            for (int i = 1; i <= reader.NumberOfPages; i++)
            {
                

                PdfDictionary page = reader.GetPageN(i);
                PdfArray annots = page.GetAsArray(iTextSharp.text.pdf.PdfName.ANNOTS);
                if (annots != null)
                    foreach (PdfObject annot in annots.ArrayList)
                    {
                        
                        //Get Annotation from PDF File
                        PdfDictionary annotationDic = (PdfDictionary)PdfReader.GetPdfObject(annot);
                        PdfName subType = (PdfName)annotationDic.Get(PdfName.SUBTYPE);
                        //check only subtype is highlight
                        if (subType.Equals(PdfName.HIGHLIGHT))
                        {
                              // Get Quadpoints and Rectangle of highlighted text
                            Console.Write("HighLight at Rectangle {0} with QuadPoints {1}\n", annotationDic.GetAsArray(PdfName.RECT), annotationDic.GetAsArray(PdfName.QUADPOINTS));

                            //Extract Text using rectangle strategy    
                            PdfArray coordinates = annotationDic.GetAsArray(PdfName.RECT);
                                                      
                            Rectangle rect = new Rectangle(float.Parse(coordinates.ArrayList[0].ToString(), CultureInfo.InvariantCulture.NumberFormat), float.Parse(coordinates.ArrayList[1].ToString(), CultureInfo.InvariantCulture.NumberFormat),
                            float.Parse(coordinates.ArrayList[2].ToString(), CultureInfo.InvariantCulture.NumberFormat),float.Parse(coordinates.ArrayList[3].ToString(), CultureInfo.InvariantCulture.NumberFormat));



                            RenderFilter[] filter = { new RegionTextRenderFilter(rect) };
                            ITextExtractionStrategy strategy;
                            StringBuilder sb = new StringBuilder();

                            
                            strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
                            sb.AppendLine(PdfTextExtractor.GetTextFromPage(reader, i, strategy));
                            
                            //Show extract text on Console
                            Console.WriteLine(sb.ToString());
                            //Console.WriteLine("Page No" + i);

                        }



                    }



            }
        }
    }
    catch (Exception ex)
    {
    }
}
like image 26
Hassan Nazeer Avatar answered Mar 20 '23 13:03

Hassan Nazeer