As per folowing post: iTextSharp PDF Reading highlighed text (highlight annotations) using C#
this code:
for (int i = pageFrom; i <= pageTo; i++) {
PdfDictionary page = reader.GetPageN(i);
PdfArray annots = page.GetAsArray(iTextSharp.text.pdf.PdfName.ANNOTS);
if (annots!=null)
foreach (PdfObject annot in annots.ArrayList) {
PdfDictionary annotation = (PdfDictionary)PdfReader.GetPdfObject(annot);
PdfString contents = annotation.GetAsString(PdfName.CONTENTS);
// now use the String value of contents
}
}
}
is working to extract PDF annotations. But why the same following code is not working for highlight (specifically PdfName.HIGHLIGHT is not working) :
for (int i = pageFrom; i <= pageTo; i++) {
PdfDictionary page = reader.GetPageN(i);
PdfArray annots = page.GetAsArray(iTextSharp.text.pdf.PdfName.HIGHLIGHT);
if (annots!=null)
foreach (PdfObject annot in annots.ArrayList) {
PdfDictionary annotation = (PdfDictionary)PdfReader.GetPdfObject(annot);
PdfString contents = annotation.GetAsString(PdfName.CONTENTS);
// now use the String value of contents
}
}
}
Please take a look at table 30 in ISO-32000-1 (aka the PDF reference). It is entitled "Entries in a page object". Among these entries, you can find a key named Annots
. Its value is:
(Optional) An array of annotation dictionaries that shall contain indirect references to all annotations associated with the page (see 12.5, "Annotations").
You will not find an entry with a key such as Highlight
, hence it is only normal that the array that is returned is null when you have this line:
PdfArray annots = page.GetAsArray(iTextSharp.text.pdf.PdfName.HIGHLIGHT);
You need to get the annotations the way you already did:
PdfArray annots = page.GetAsArray(iTextSharp.text.pdf.PdfName.ANNOTS);
Now you need to loop over this array and look for annotations with Subtype
equal to Highlight
. This type of annotation is listed in table 169 of ISO-32000-1, entitled "Annotation types".
In other words, your assumption that a page dictionary contains entries with key Highlight
was wrong and if you read the whole specification, you will also discover another false assumption you've been making. You are falsely assuming that the highlighted text is stored in the Contents
entry of the annotations. This reveals a lack of understanding about the nature of annotations versus page content.
The text you are looking for is stored in the content stream of the page. The content stream of the page is independent of the page's annotations. Hence, to get the highlighted text, you need to get the coordinates stored in the Highlight
annotation (stored in the QuadPoints
array) and you need to use these coordinates to parse the text that is present in the page content at those coordinates.
Here is complete example of extracting highlighted text using itextSharp
public void GetRectAnno()
{
string appRootDir = new DirectoryInfo(Environment.CurrentDirectory).Parent.Parent.FullName;
string filePath = appRootDir + "/PDFs/" + "anot.pdf";
int pageFrom = 0;
int pageTo = 0;
try
{
using (PdfReader reader = new PdfReader(filePath))
{
pageTo = reader.NumberOfPages;
for (int i = 1; i <= reader.NumberOfPages; i++)
{
PdfDictionary page = reader.GetPageN(i);
PdfArray annots = page.GetAsArray(iTextSharp.text.pdf.PdfName.ANNOTS);
if (annots != null)
foreach (PdfObject annot in annots.ArrayList)
{
//Get Annotation from PDF File
PdfDictionary annotationDic = (PdfDictionary)PdfReader.GetPdfObject(annot);
PdfName subType = (PdfName)annotationDic.Get(PdfName.SUBTYPE);
//check only subtype is highlight
if (subType.Equals(PdfName.HIGHLIGHT))
{
// Get Quadpoints and Rectangle of highlighted text
Console.Write("HighLight at Rectangle {0} with QuadPoints {1}\n", annotationDic.GetAsArray(PdfName.RECT), annotationDic.GetAsArray(PdfName.QUADPOINTS));
//Extract Text using rectangle strategy
PdfArray coordinates = annotationDic.GetAsArray(PdfName.RECT);
Rectangle rect = new Rectangle(float.Parse(coordinates.ArrayList[0].ToString(), CultureInfo.InvariantCulture.NumberFormat), float.Parse(coordinates.ArrayList[1].ToString(), CultureInfo.InvariantCulture.NumberFormat),
float.Parse(coordinates.ArrayList[2].ToString(), CultureInfo.InvariantCulture.NumberFormat),float.Parse(coordinates.ArrayList[3].ToString(), CultureInfo.InvariantCulture.NumberFormat));
RenderFilter[] filter = { new RegionTextRenderFilter(rect) };
ITextExtractionStrategy strategy;
StringBuilder sb = new StringBuilder();
strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
sb.AppendLine(PdfTextExtractor.GetTextFromPage(reader, i, strategy));
//Show extract text on Console
Console.WriteLine(sb.ToString());
//Console.WriteLine("Page No" + i);
}
}
}
}
}
catch (Exception ex)
{
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With