Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

iText GetTextFromPage returns the text from the begining for evey page

Tags:

itext

I have this simple piece. The problem is very strange - on every iteration, the reader returns the whole text since the beginning of the pdf document. Probably this is something simple, but I can't see it.

...
PdfReader reader = new PdfReader ( path );
PdfReaderContentParser parser = new PdfReaderContentParser ( reader );
...
public void Read(int start, int end)
{
    ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();

    StringBuilder sb = new StringBuilder();

    for (int page = start; page < end; page++)
    {
        try
        {
            sb.Append(PdfTextExtractor.GetTextFromPage(reader, page, strategy));
        }
        catch (Exception ex)
        {
            throw new PdfException(ex.Message, ex.InnerException);
        }

        var p = new Page { Number = page, Content = sb.ToString()};
        sb.Clear();
        PageParsed?.Invoke(this, new PdfEventArgs<Page>(p));
    }
    FileParsed?.Invoke(this, new PdfEventArgs<string>(string.IsNullOrEmpty(Name) ? "File parsed" : Name));
}
like image 496
Matt Avatar asked Oct 18 '25 11:10

Matt


1 Answers

The strategy object keeps the state, so you have to move the object instantiation inside your loop like this:

StringBuilder sb = new StringBuilder();

for (int page = start; page < end; page++)
{
    ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
    try
    {
        sb.Append(PdfTextExtractor.GetTextFromPage(reader, page, strategy));
    }
    catch (Exception ex)
    {
        throw new PdfException(ex.Message, ex.InnerException);
    }

    var p = new Page { Number = page, Content = sb.ToString()};
    sb.Clear();
    PageParsed?.Invoke(this, new PdfEventArgs<Page>(p));
}

This will solve your problem.

like image 66
Bruno Lowagie Avatar answered Oct 21 '25 23:10

Bruno Lowagie