Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why are GetTextFromPage from iTextSharp returning longer and longer strings?

I am using the latest iTextSharp lib from nuGet (5.5.8) to parse some text from a pdf-file. The problem I am facing is that GetTextFromPage method does not only return the text from the page that it should, it also returns the text from the previous page. Here is my code:

var url = "https://www.oslo.kommune.no/getfile.php/Innhold/Politikk%20og%20administrasjon/Etater%20og%20foretak/Utdanningsetaten/Postjournal%20Utdanningsetaten/UDE03032016.pdf";
var strategy = new SimpleTextExtractionStrategy();
using (var reader = new PdfReader(new Uri(url)))
{
    for (var page = 1; page <= reader.NumberOfPages; page++)
    {
        var textFromPage = PdfTextExtractor.GetTextFromPage(reader, page, strategy);
        Console.WriteLine(textFromPage.Length);
    }
}

The output looks like this, which is not what i need. I need the text that is actually on the page:

1106
2248
3468
4835
5167
6431
7563
8860
9962
11216
12399
13640
14690
15760

Any ideas?

like image 662
Espo Avatar asked Jan 02 '26 03:01

Espo


1 Answers

You feed all pages into the same text extraction strategy:

var strategy = new SimpleTextExtractionStrategy();
using (var reader = new PdfReader(new Uri(url)))
{
    for (var page = 1; page <= reader.NumberOfPages; page++)
    {
        var textFromPage = PdfTextExtractor.GetTextFromPage(reader, page, strategy);
        ... process textFromPage ...
    }
}

As you want to process the content of each page by itself, you should instead create a new strategy for each page:

using (var reader = new PdfReader(new Uri(url)))
{
    for (var page = 1; page <= reader.NumberOfPages; page++)
    {
        var strategy = new SimpleTextExtractionStrategy();
        var textFromPage = PdfTextExtractor.GetTextFromPage(reader, page, strategy);
        ... process textFromPage ...
    }
}
like image 147
mkl Avatar answered Jan 04 '26 17:01

mkl



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!