Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PDF Reading highlighed text (highlight annotations) using C#

I have written an extraction tool using iTextSharp that extracts annotation information from PDF documents. For the highlight annotation, I only get a rectangle for the area on the page which is highlighted.

I am aiming for extracting the text that has been highlighted. For that I use `PdfTextExtractor'.

Rectangle rect = new Rectangle(
    pdfArray.GetAsNumber(0).FloatValue, 
    pdfArray.GetAsNumber(1).FloatValue,
    pdfArray.GetAsNumber(2).FloatValue,
    pdfArray.GetAsNumber(3).FloatValue);

RenderFilter[] filter = { new RegionTextRenderFilter(rect) };
ITextExtractionStrategy strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
string textInsideRect = PdfTextExtractor.GetTextFromPage(pdfReader, pageNo, strategy);
return textInsideRect;

The result returned by PdfTextExtractor is not entirely correct. For instance it returns "was going to eliminate the paper chase" even though only "eliminate" was highlighted.

Interesting enough the entire text for the TJ containing the highlighted "eliminate" is "was going to eliminate the paper chase" (TJ is the PDF instruction that writes text to the page).

I would love to hear any input regarding this issue - also solutions that doesn't involve iTextSharp.

like image 944
sdalby Avatar asked Dec 16 '22 02:12

sdalby


2 Answers

The cause

Interesting enough the entire text for the TJ containing the highlighted "eliminate" is "was going to eliminate the paper chase" (TJ is the PDF instruction that writes text to the page).

This actually is the reason for your issue. The iText parser classes forward the text to the render listeners in the pieces they find as continuous strings in the content stream. The filter mechanism you use filters these pieces. Thus, that whole sentence is accepted by the filter.

What you need, therefore, is some pre-processing step which splits these pieces into their individual characters and forwards these individually to your filtered render listener.

This actually is fairly easy to implement. The argument type in which the text pieces are forwarded, TextRenderInfo, offers a method to split itself up:

/**
 * Provides detail useful if a listener needs access to the position of each individual glyph in the text render operation
 * @return A list of {@link TextRenderInfo} objects that represent each glyph used in the draw operation. The next effect is if there was a separate Tj opertion for each character in the rendered string
 * @since 5.3.3
 */
public List<TextRenderInfo> getCharacterRenderInfos() // iText / Java
virtual public List<TextRenderInfo> GetCharacterRenderInfos() // iTextSharp / .Net

Thus, all you have to do is create and use a RenderListener / IRenderListener implementation which forwards all the calls it gets to another listener (your filtered listener in your case) with the twist that renderText / RenderText splits its TextRenderInfo argument and forwards the splinters one by one individually.

A Java sample

As the OP asked for more details, here some more code. As I'm predominantly working with Java, though, I'm providing it in Java for iText. But it is easy to port to C# for iTextSharp.

As mentioned above a pre-processing step is needed which splits the text pieces into their individual characters and forwards them individually to your filtered render listener.

For this step you can use this class TextRenderInfoSplitter:

package stackoverflow.itext.extraction;

import com.itextpdf.text.pdf.parser.ImageRenderInfo;
import com.itextpdf.text.pdf.parser.TextExtractionStrategy;
import com.itextpdf.text.pdf.parser.TextRenderInfo;

public class TextRenderInfoSplitter implements TextExtractionStrategy
{
    public TextRenderInfoSplitter(TextExtractionStrategy strategy)
    {
        this.strategy = strategy;
    }

    public void renderText(TextRenderInfo renderInfo)
    {
        for (TextRenderInfo info : renderInfo.getCharacterRenderInfos())
        {
            strategy.renderText(info);
        }
    }

    public void beginTextBlock()
    {
        strategy.beginTextBlock();
    }

    public void endTextBlock()
    {
        strategy.endTextBlock();
    }

    public void renderImage(ImageRenderInfo renderInfo)
    {
        strategy.renderImage(renderInfo);
    }

    public String getResultantText()
    {
        return strategy.getResultantText();
    }

    final TextExtractionStrategy strategy;
}

If you have a TextExtractionStrategy strategy (like your new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter)), you now can feed it with single-character TextRenderInfo instances like this:

String textInsideRect = PdfTextExtractor.getTextFromPage(reader, pageNo, new TextRenderInfoSplitter(strategy));

I tested it with the PDF created in this answer for the area

Rectangle rect = new Rectangle(200, 600, 200, 135);

For reference I marked the area in the PDF:

Screenshot of PDF with marked area

Text extraction filtered by area without the TextRenderInfoSplitter results in:

I am trying to create a PDF file with a lot
of text contents in the document. I am
using PDFBox

Text extraction filtered by area with the TextRenderInfoSplitter results in:

 to create a PDF f
ntents in the docu
n g P D F

BTW, you here see a disadvantage of splitting the text into individual characters early: The final text line is typeset using very large character spacing. If you keep the text segments from the PDF as they are, text extraction strategies still easily can see that the line consists of the two words using and PDFBox. As soon as you feed the text segments character by character into the text extraction strategies, they are likely to interpret such widely set words as many one-letter words.

An improvement

The highlighted word "eliminate" is for instance extracted as "o eliminate t". This has been highlighted by double clicking the word and highlighted in Adobe Acrobat Reader.

Something similar happens in my sample above, letters barely touching the area of interest make it into the result.

This is due to the RegionTextRenderFilter implementation of allowText allowing all text to continue whose baseline intersects the rectangle in question, even if the intersection consists of merely a single dot:

public boolean allowText(TextRenderInfo renderInfo){
    LineSegment segment = renderInfo.getBaseline();
    Vector startPoint = segment.getStartPoint();
    Vector endPoint = segment.getEndPoint();

    float x1 = startPoint.get(Vector.I1);
    float y1 = startPoint.get(Vector.I2);
    float x2 = endPoint.get(Vector.I1);
    float y2 = endPoint.get(Vector.I2);

    return filterRect.intersectsLine(x1, y1, x2, y2);
}

Given that you first split the text into characters, you might want to check whether their respective base line is completely contained in the area in question, i.e. implement an own RenderFilter by copying RegionTextRenderFilter and then replacing the line

return filterRect.intersectsLine(x1, y1, x2, y2);

by

return filterRect.contains(x1, y1) && filterRect.contains(x2, y2);

Depending on how exactly exactly text is highlighted in Adobe Acrobat Reader, though, you might want to change this in a completely custom way.

like image 176
mkl Avatar answered Dec 28 '22 07:12

mkl


Highlight annotations are represented a collection of quadrilaterals that represent the area(s) on the page surrounded by the annotation in the /QuadPoints entry in the dictionary.

Why are they this way?

This is my fault, actually. In Acrobat 1.0, I worked on the "find text" code which initially only used a rectangle for the representation of a selected area on the page. While working on the code, I was very unhappy with the results, especially with maps where the text followed land details.

As a result, I made the find tool build up a set of quadrilaterals on the page and anneal them, when possible, to build words.

In Acrobat 2.0, the engineer responsible for full generalized text extraction built an algorithm called Wordy that was better than my first cut, but he kept the quadrilateral code since that was the most accurate representation of what was on the page.

Almost all text-related code was refactored to use this code.

Then we get highlight annotations. When markup annotations were added to Acrobat, they were used to decorate text that was already on the page. When a user clicks down on a page, Wordy extracts the text into appropriate data structures and then the text select tool maps mouse motion onto the quadrilateral sets. When a text highlight annotation is created, the subset of quadrilaterals from Wordy get placed into a new text highlight annotation.

How do you get the words on the page that are highlighted. Tricky. You have to extract the text on the page (you don't have Wordy, sorry) and then find all quads that are contained within the set from the annotation.

like image 26
plinth Avatar answered Dec 28 '22 09:12

plinth