Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract PDF text by coordinates

Tags:

c#

pdf

.net-4.0

I'd like to know if there's some PDF library in Microsoft .NET being able of extracting text by giving coordinates.

For example (in pseudo-code):

PdfReader reader = new PdfReader();
reader.Load("file.pdf");

// Top, bottom, left, right in pixels or any other unit
string wholeText = reader.GetText(100, 150, 20, 50);

I've tried to do so using PDFBox for .NET (that one working on top of IKVM) with no luck, and it seems to be very outdated and undocumented.

Perhaps anyone has a good sample of doing so with PDFBox, iTextSharp or any other open-sourced library, and he/she can give me a hint.

Thank you in advance.

like image 863
Matías Fidemraizer Avatar asked Sep 13 '11 16:09

Matías Fidemraizer


2 Answers

Well, thank you for your effort anyone.

I got it using Apache's PDFBox on top of IKVM compilation, and this is the final code:

PDDocument doc = PDDocument.load(@"c:\invoice.pdf");

PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.addRegion("testRegion", new java.awt.Rectangle(0, 10, 100, 100));
stripper.extractRegions((PDPage)doc.getDocumentCatalog().getAllPages().get(0));

string text = stripper.getTextForRegion("testRegion");

And it works like a charm.

Thank you anyway and I hope my own answer will help others. If you need further details, just comment out here and I'll update this answer.

like image 194
Matías Fidemraizer Avatar answered Sep 21 '22 22:09

Matías Fidemraizer


It's not open source, but hopefully this helps you (and potentially anyone else using ABCPDF!)

I did this earlier today by looping over the available fields in the PDF. This means that the PDF you are using needs to be created properly and you need to know the field name that you want to get the text for (you could work this out by adding a breakpoint and looping through the available fields).

WebSupergoo.ABCpdf6.Doc newPDF = new WebSupergoo.ABCpdf6.Doc();
newPDF.Read("existing_file.pdf");

foreach ( WebSupergoo.ABCpdf6.Objects.Field field in newPDF.Form.Fields )
{
    if ( field.Name == "Text1" )
    {
        // update "Text1"
        field.Value = "new value for Text1";
    }
}

newPDF.Save("new_file.pdf");

newPDF.Clear();

In the example, "Text1" is the name of the field that is being updated. Note I am also providing an example for saving out updated field(s).

Hopefully that at least gives you an idea of how to approach this problem.

like image 36
Ben Pearson Avatar answered Sep 20 '22 22:09

Ben Pearson