Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read data from table-structured PDF using itextsharp?

Tags:

c#

itextsharp

I am having a problem with reading some data from pdf file.
My file is structurized and it contains tables and plain text. Standard parser reads data from separate columns at the same line. For example:

Some Table Header  
Data Col1a     Data Col2a      Data Col3a
Data Col1b     Data Col2b      Data Col3b
               Data Col2c

with this code

        PdfReader reader = new PdfReader(pdfName);

        List<String> text = new List<String>();
        String page;
        List<String> pageStrings;
        string[] separators = { "\n", "\r\n" };

        for (int i = 1; i <= reader.NumberOfPages; i++)
        {
            page = PdfTextExtractor.GetTextFromPage(reader, i);
            pageStrings = new List<string>(page.Split(separators, StringSplitOptions.RemoveEmptyEntries));
            text.AddRange(pageStrings);

        }

        reader.Close();

        return text;

will be concatenated into strings:

Some Table Header
Data Col1a Data Col2a Data Col3a  
Data Col1b Data Col2b Data Col3b  
Data Col2c  

I'd like to get concatenated strings that will reflect data from blocks. I'd like to get such strings for upper example:

Some Table Header
Data Col1a Data Col1b   
Data Col2a Data Col2b Data Col2c  
Data Col3a Data Col3b

Does anyone have any idea how to tune itextsharp to get such behavior of pdf parser? Maybe someone has appropriate code sample?
The sample PDF file is here

like image 594
Vadym Romanenko Avatar asked Aug 14 '15 16:08

Vadym Romanenko


1 Answers

The OP's sample file contains multiple sections like this one:

sampleFile.pdf, top of page 1

And the OP mentioned in a comment:

another one tool parse my PDF exactly like I want. [...]

PS: this tool is pdfbox

Using PDFBox (v1.8.10, the current release version) in this method:

String extract(PDDocument document) throws IOException
{
    PDFTextStripper stripper = new PDFTextStripper();
    return stripper.getText(document);
}

returns for the section shown above

Driver Book for 8/5/2015
Company IS MEDICAL; AND Date of Service IS BETWEEN 08/05/2015 AND 08/05/2015; AND Status IS Assigned; AND Vehicles IS  MEDICAL: 
CATY
 MEDICAL
Trip #: 314-A
Comments: ----LIVERY---
Destination:Pick-up:
Call Type: Livery
<Doctor Office>
REGO PARK,  (631) 
000-0000
(718) 896-5953
74- AVE 204E  HEIGHTS, NY 
11372 (718) 639-4154
11:00:00 PAT, MIKHAIL
Trip #: 314-B
Comments:  ----LIVERY---
Destination:Pick-up:
Call Type: Livery
74- AVE 204E  HEIGHTS, NY 
11372 (718) 639-4154
<Doctor Office>
63-6 REGO PARK, NY 
11374 (631) 000-0000
11:01:00 PAT, MIKHAIL

This is not really a neat column-wise extraction but certain blocks of information (like address blocks) remain together.

Getting the same output with iText(Sharp) actually is very easy: One merely has to explicitly use the SimpleTextExtractionStrategy instead of the LocationTextExtractionStrategy which is used by default, i.e. one has to replace this line

page = PdfTextExtractor.GetTextFromPage(reader, i);

by

page = PdfTextExtractor.GetTextFromPage(reader, i, new SimpleTextExtractionStrategy());

With the exception of one space character per dataset (iText(Sharp) extracts Destination: Pick-up: instead of Destination:Pick-up:) the results are identical.


Concerning your conclusion from PDFBox extracting the text as it does:

So I think that PDF is really table structured.

Actually this order of extraction means merely that the operations for drawing the string segments in the PDF page content stream occur in this very order. As the order of those operations is arbitrary according to the PDF specification, any update of the software generating those PDFs may result in files from which the PDFBox PDFTextStripper and the iText SimpleTextExtractionStrategy extract merely an unintelligible soup of characters.


PS: If one sets the PDFBox PDFTextStripper property SortByPosition to true like this

    PDFTextStripper stripper = new PDFTextStripper();
    stripper.setSortByPosition(true);
    return stripper.getText(document);

then PDFBox extracts the text just like iText(Sharp) with the (default) LocationTextExtractionStrategy does


The OP indicated interest in a block structure inherent in the content stream. The most obvious structure like that in a generic PDF would be the text objects (in which multiple strings may be drawn).

In the case at hand the SimpleTextExtractionStrategy is used. It can easily be extended to also include markers corresponding to the start and end of text objects in its output. In Java this can be done by using an anonymous class like this:

return PdfTextExtractor.getTextFromPage(reader, pageNo, new SimpleTextExtractionStrategy()
{
    boolean empty = true;

    @Override
    public void beginTextBlock()
    {
        if (!empty)
            appendTextChunk("<BLOCK>");
        super.beginTextBlock();
    }

    @Override
    public void endTextBlock()
    {
        if (!empty)
            appendTextChunk("</BLOCK>\n");
        super.endTextBlock();
    }

    @Override
    public String getResultantText()
    {
        if (empty)
            return super.getResultantText();
        else
            return "<BLOCK>" + super.getResultantText();
    }

    @Override
    public void renderText(TextRenderInfo renderInfo)
    {
        empty = false;
        super.renderText(renderInfo);
    }
});

(TextExtraction.java method extractSimple)

(This Java code should easily be translatable into C#. The playing around with an empty boolean may look funny; it is necessary, though, because the base class assumes certain additional properties to be set as soon as some chunk has been appended to the extracted content.)

Using this extended strategy one gets for the section shown above:

<BLOCK>Driver Book for 8/5/2015
Company IS MEDICAL; AND Date of Service IS BETWEEN 08/05/2015 AND 08/05/2015; AND Status IS Assigned; AND Vehicles IS  MEDICAL: 
CATY</BLOCK>
<BLOCK>
 MEDICAL</BLOCK>
<BLOCK>
Trip #: 314-A</BLOCK>
<BLOCK>
Comments: ----LIVERY---</BLOCK>
<BLOCK>
Destination: Pick-up:</BLOCK>
<BLOCK>
Call Type: Livery
<Doctor Office>
REGO PARK,  (631) 
000-0000
(718) 896-5953</BLOCK>
<BLOCK>
74- AVE 204E  HEIGHTS, NY 
11372 (718) 639-4154</BLOCK>
<BLOCK>
11:00:00</BLOCK>
<BLOCK> PAT, MIKHAIL</BLOCK>
<BLOCK>
Trip #: 314-B</BLOCK>
<BLOCK>
Comments:  ----LIVERY---</BLOCK>
<BLOCK>
Destination: Pick-up:</BLOCK>
<BLOCK>
Call Type: Livery
74- AVE 204E  HEIGHTS, NY 
11372 (718) 639-4154</BLOCK>
<BLOCK>
<Doctor Office>
63-6 REGO PARK, NY 
11374 (631) 000-0000</BLOCK>
<BLOCK>
11:01:00</BLOCK>
<BLOCK> PAT, MIKHAIL</BLOCK>

As this keeps addresses in the same block, this might help during extraction.

like image 150
mkl Avatar answered Nov 09 '22 15:11

mkl