Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get the text position from the pdf page in iText 7

Tags:

itext7

I am trying to find the text position in PDF page?

What I have tried is to get the text in the PDF page by PDF Text Extractor using simple text extraction strategy. I am looping each word to check if my word exists. split the words using:

var Words = pdftextextractor.Split(new char[] { ' ', '\n' });

What I wasn't able to do is to find the text position. The problem is I wasn't able to find the location of the text. All I need to find is the y co-ordinates of the word in the PDF file.

like image 477
Speed Avatar asked Dec 02 '22 11:12

Speed


1 Answers

I was able to manipulate it with my previous version for Itext5. I don't know if you are looking for C# but that is what the below code is written in.

using iText.Kernel.Geom;
using iText.Kernel.Pdf.Canvas.Parser;
using iText.Kernel.Pdf.Canvas.Parser.Data;
using iText.Kernel.Pdf.Canvas.Parser.Listener;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

class TextLocationStrategy : LocationTextExtractionStrategy
{
    private List<textChunk> objectResult = new List<textChunk>();

    public override void EventOccurred(IEventData data, EventType type)
    {
        if (!type.Equals(EventType.RENDER_TEXT))
            return;

        TextRenderInfo renderInfo = (TextRenderInfo)data;

        string curFont = renderInfo.GetFont().GetFontProgram().ToString();

        float curFontSize = renderInfo.GetFontSize();

        IList<TextRenderInfo> text = renderInfo.GetCharacterRenderInfos();
        foreach (TextRenderInfo t in text)
        {
            string letter = t.GetText();
            Vector letterStart = t.GetBaseline().GetStartPoint();
            Vector letterEnd = t.GetAscentLine().GetEndPoint();
            Rectangle letterRect = new Rectangle(letterStart.Get(0), letterStart.Get(1), letterEnd.Get(0) - letterStart.Get(0), letterEnd.Get(1) - letterStart.Get(1));

            if (letter != " " && !letter.Contains(' '))
            {
                textChunk chunk = new textChunk();
                chunk.text = letter;
                chunk.rect = letterRect;
                chunk.fontFamily = curFont;
                chunk.fontSize = curFontSize;
                chunk.spaceWidth = t.GetSingleSpaceWidth() / 2f;

                objectResult.Add(chunk);
            }
        }
    }
}
public class textChunk
{
    public string text { get; set; }
    public Rectangle rect { get; set; }
    public string fontFamily { get; set; }
    public int fontSize { get; set; }
    public float spaceWidth { get; set; }
}

I also get down to each individual character because it works better for my process. You can manipulate the names, and of course the objects, but I created the textchunk to hold what I wanted, rather than have a bunch of renderInfo objects.

You can implement this by adding a few lines to grab the data from your pdf.

PdfDocument reader = new PdfDocument(new PdfReader(filepath));
FilteredEventListener listener = new FilteredEventListener();
var strat = listener.AttachEventListener(new TextExtractionStrat());
PdfCanvasProcessor processor = new PdfCanvasProcessor(listener);
processor.ProcessPageContent(reader.GetPage(1));

Once you are this far, you can pull the objectResult from the strat by making it public or creating a method within your class to grab the objectResult and do something with it.

like image 160
Histerical Avatar answered Dec 29 '22 10:12

Histerical