Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract text from PDF between two dividers with ITextSharp

Tags:

c#

pdf

itextsharp

I have a 1500+ pages pdf with some 'random' text and I have to extract some text from that... I can identify that block like that:

bla bla bla bla bla 
...
...
...
-------------------------- (separator blue image)
XXX: TEXT TEXT TEXT
TEXT TEXT TEXT TEXT
...
-------------------------- (separator blue image)
bla bla bla bla
...
...
-------------------------- (separator blue image)
XXX: TEXT2 TEXT2 TEXT2
TEXT2 TEXT2 TEXT TEXT2
...
-------------------------- (separator blue image)

I need extract all text beetween separators (all blocks) The 'XXX' is present in the begining of all block, but I dont have any way to detect the end of the block. Is it possible to use the image separator in the parser? How?

Any other possible way?

Edit More information No backgroud and the text is copy&pastable

Sample pdf : 1

Look for example page 320

Thanks

like image 208
Paul Avatar asked Jul 30 '15 17:07

Paul


1 Answers

The theory

In case of your sample PDF the dividers are created using vector graphics:

0.58 0.17 0 0.47 K
q 1 0 0 1 56.6929 772.726 cm
0 0 m
249.118 0 l
S
Q
q 1 0 0 1 56.6929 690.9113 cm
0 0 m
249.118 0 l
S 

etc.

Parsing vector graphics is a fairly new addition to iText(Sharp), and in that respect the API is up for some changes. Currently (version 5.5.6) you can parse vector graphics using an implementation of the interface ExtRenderListener (Java) / IExtRenderListener (.Net).

You now have some approaches to your task:

  • (multi-pass) You can implement the above-mentioned interface in a way that merely collects the lines. From these lines you derive rectangles encompassing each section, and for each of these rectangles you can extract the text applying region text filtering.
  • (two-pass) Just like above you can implement the above-mentioned interface in a way that merely collects the lines and from these lines you derive rectangles encompassing each section. Then you parse the page using the LocationTextExtractionStrategy and request the text of each rectangle using an appropriate ITextChunkFilter using the GetResultantText(ITextChunkFilter) overload.
  • (one pass) You can implement the above-mentioned interface in a way that collects the lines, collects text pieces, derives rectangles from the lines and arranges the text pieces located in those rectangles.

A sample implementation

(As I'm more fluent in Java than in C#, I implemented this sample in Java for iText. It should be easy to port to C# and iTextSharp.)

This implementation attempts to extract text sections separated by dividers as in the sample PDF.

It is a one-pass solution which at the same time re-uses the existing LocationTextExtractionStrategy capabilities by deriving from that strategy.

In the same pass this strategy collects the text chunks (thanks to its parent class) and the divider lines (due to its implementation of the ExtRenderListener extra methods).

Having parsed a page, the strategy offers a list of Section instances via the method getSections(), each representing a section of the page delimited by a divider line above and/or below. The topmost and bottommost sections of each text column are open at the top or the bottom, implicitly delimited by the matching margin line.

Section implements the TextChunkFilter interface and, therefore, can be used to retrieve the text in the respective part of the page using the method getResultantText(TextChunkFilter) of the parent class.

This is merely a POC, it is designed to extract sections from documents using dividers exactly like the sample document does, i.e. horizontal lines drawn using moveTo-lineTo-stroke as wide as the section is, appearing in the content stream column-wise sorted. There may be still more implicit assumptions true for the sample PDF.

public class DividerAwareTextExtrationStrategy extends LocationTextExtractionStrategy implements ExtRenderListener
{
    //
    // constructor
    //
    /**
     * The constructor accepts top and bottom margin lines in user space y coordinates
     * and left and right margin lines in user space x coordinates.
     * Text outside those margin lines is ignored. 
     */
    public DividerAwareTextExtrationStrategy(float topMargin, float bottomMargin, float leftMargin, float rightMargin)
    {
        this.topMargin = topMargin;
        this.bottomMargin = bottomMargin;
        this.leftMargin = leftMargin;
        this.rightMargin = rightMargin;
    }

    //
    // Divider derived section support
    //
    public List<Section> getSections()
    {
        List<Section> result = new ArrayList<Section>();
        // TODO: Sort the array columnwise. In case of the OP's document, the lines already appear in the
        // correct order, so there was no need for sorting in the POC. 

        LineSegment previous = null;
        for (LineSegment line : lines)
        {
            if (previous == null)
            {
                result.add(new Section(null, line));
            }
            else if (Math.abs(previous.getStartPoint().get(Vector.I1) - line.getStartPoint().get(Vector.I1)) < 2) // 2 is a magic number... 
            {
                result.add(new Section(previous, line));
            }
            else
            {
                result.add(new Section(previous, null));
                result.add(new Section(null, line));
            }
            previous = line;
        }

        return result;
    }

    public class Section implements TextChunkFilter
    {
        LineSegment topLine;
        LineSegment bottomLine;

        final float left, right, top, bottom;

        Section(LineSegment topLine, LineSegment bottomLine)
        {
            float left, right, top, bottom;
            if (topLine != null)
            {
                this.topLine = topLine;
                top = Math.max(topLine.getStartPoint().get(Vector.I2), topLine.getEndPoint().get(Vector.I2));
                right = Math.max(topLine.getStartPoint().get(Vector.I1), topLine.getEndPoint().get(Vector.I1));
                left = Math.min(topLine.getStartPoint().get(Vector.I1), topLine.getEndPoint().get(Vector.I1));
            }
            else
            {
                top = topMargin;
                left = leftMargin;
                right = rightMargin;
            }

            if (bottomLine != null)
            {
                this.bottomLine = bottomLine;
                bottom = Math.min(bottomLine.getStartPoint().get(Vector.I2), bottomLine.getEndPoint().get(Vector.I2));
                right = Math.max(bottomLine.getStartPoint().get(Vector.I1), bottomLine.getEndPoint().get(Vector.I1));
                left = Math.min(bottomLine.getStartPoint().get(Vector.I1), bottomLine.getEndPoint().get(Vector.I1));
            }
            else
            {
                bottom = bottomMargin;
            }

            this.top = top;
            this.bottom = bottom;
            this.left = left;
            this.right = right;
        }

        //
        // TextChunkFilter
        //
        @Override
        public boolean accept(TextChunk textChunk)
        {
            // TODO: This code only checks the text chunk starting point. One should take the 
            // whole chunk into consideration
            Vector startlocation = textChunk.getStartLocation();
            float x = startlocation.get(Vector.I1);
            float y = startlocation.get(Vector.I2);

            return (left <= x) && (x <= right) && (bottom <= y) && (y <= top);
        }
    }

    //
    // ExtRenderListener implementation
    //
    /**
     * <p>
     * This method stores targets of <code>moveTo</code> in {@link #moveToVector}
     * and targets of <code>lineTo</code> in {@link #lineToVector}. Any unexpected
     * contents or operations result in clearing of the member variables.
     * </p>
     * <p>
     * So this method is implemented for files with divider lines exactly like in
     * the OP's sample file.
     * </p>
     *  
     * @see ExtRenderListener#modifyPath(PathConstructionRenderInfo)
     */
    @Override
    public void modifyPath(PathConstructionRenderInfo renderInfo)
    {
        switch (renderInfo.getOperation())
        {
        case PathConstructionRenderInfo.MOVETO:
        {
            float x = renderInfo.getSegmentData().get(0);
            float y = renderInfo.getSegmentData().get(1);
            moveToVector = new Vector(x, y, 1);
            lineToVector = null;
            break;
        }
        case PathConstructionRenderInfo.LINETO:
        {
            float x = renderInfo.getSegmentData().get(0);
            float y = renderInfo.getSegmentData().get(1);
            if (moveToVector != null)
            {
                lineToVector = new Vector(x, y, 1);
            }
            break;
        }
        default:
            moveToVector = null;
            lineToVector = null;
        }
    }

    /**
     * This method adds the current path to {@link #lines} if it consists
     * of a single line, the operation is no no-op, and the line is
     * approximately horizontal.
     *  
     * @see ExtRenderListener#renderPath(PathPaintingRenderInfo)
     */
    @Override
    public Path renderPath(PathPaintingRenderInfo renderInfo)
    {
        if (moveToVector != null && lineToVector != null &&
            renderInfo.getOperation() != PathPaintingRenderInfo.NO_OP)
        {
            Vector from = moveToVector.cross(renderInfo.getCtm());
            Vector to = lineToVector.cross(renderInfo.getCtm());
            Vector extent = to.subtract(from);

            if (Math.abs(20 * extent.get(Vector.I2)) < Math.abs(extent.get(Vector.I1)))
            {
                LineSegment line;
                if (extent.get(Vector.I1) >= 0)
                    line = new LineSegment(from, to);
                else
                    line = new LineSegment(to, from);
                lines.add(line);
            }
        }

        moveToVector = null;
        lineToVector = null;
        return null;
    }

    /* (non-Javadoc)
     * @see com.itextpdf.text.pdf.parser.ExtRenderListener#clipPath(int)
     */
    @Override
    public void clipPath(int rule)
    {
    }

    //
    // inner members
    //
    final float topMargin, bottomMargin, leftMargin, rightMargin;
    Vector moveToVector = null;
    Vector lineToVector = null;
    final List<LineSegment> lines = new ArrayList<LineSegment>();
}

(DividerAwareTextExtrationStrategy.java)

It can be used like this

String extractAndStore(PdfReader reader, String format, int from, int to) throws IOException
{
    StringBuilder builder = new StringBuilder();

    for (int page = from; page <= to; page++)
    {
        PdfReaderContentParser parser = new PdfReaderContentParser(reader);
        DividerAwareTextExtrationStrategy strategy = parser.processContent(page, new DividerAwareTextExtrationStrategy(810, 30, 20, 575));

        List<Section> sections = strategy.getSections();
        int i = 0;
        for (Section section : sections)
        {
            String sectionText = strategy.getResultantText(section);
            Files.write(Paths.get(String.format(format, page, i)), sectionText.getBytes("UTF8"));

            builder.append("--\n")
                   .append(sectionText)
                   .append('\n');
            i++;
        }
        builder.append("\n\n");
    }

    return builder.toString();
}

(DividerAwareTextExtraction.java method extractAndStore)

Applying this method to pages 319 and 320 of your sample PDF

PdfReader reader = new PdfReader("20150211600.PDF");
String content = extractAndStore(reader, new File(RESULT_FOLDER, "20150211600.%s.%s.txt").toString(), 319, 320);

(DividerAwareTextExtraction.java test test20150211600_320)

results in

--
do(s) bem (ns) exceder o seu crédito, depositará, no prazo de 3 (três) 
dias, a diferença, sob pena de ser tornada sem efeito a arrematação 
[...]
EDITAL DE INTIMAÇÃO DE ADVOGADOS
RELAÇÃO Nº 0041/2015
--
ADV: RODRIGO TADEU PIMENTA DE OLIVEIRA (OAB 16752/
SC)
Processo 0033473-16.2010.8.24.0023 (023.10.033473-6) - Ação Penal
Militar - Procedimento Ordinário - Militar - Autor: Ministério Público 
do Estado de Santa Catarina - Réu: João Gabriel Adler - Publicada a 
sentença neste ato, lida às partes e intimados os presentes. Registre-se.
A defesa manifesta o interesse em recorrer da sentença.
--
ADV: RODRIGO TADEU PIMENTA DE OLIVEIRA (OAB 16752/
SC), CARLOS ROBERTO PEREIRA (OAB 29179/SC), ROBSON 
LUIZ CERON (OAB 22475/SC)
Processo 0025622-86.2011.8.24.0023 (023.11.025622-3) - Ação
[...]
1, NIVAEL MARTINS PADILHA, Mat. 928313-7, ANDERSON
VOGEL e ANTÔNIO VALDEMAR FORTES, no ato deprecado.


--

--
ADV: RODRIGO TADEU PIMENTA DE OLIVEIRA (OAB 16752/
SC)
Processo 0006958-36.2013.8.24.0023 (023.13.006958-5) - Ação Penal
Militar - Procedimento Ordinário - Crimes Militares - Autor: Ministério
Público do Estado de Santa Catarina - Réu: Pedro Conceição Bungarten
- Ficam intimadas as partes, da decisão de fls. 289/290, no prazo de 
05 (cinco) dias.
--
ADV: RODRIGO TADEU PIMENTA DE OLIVEIRA (OAB 16752/
SC), ROBSON LUIZ CERON (OAB 22475/SC)
Processo 0006967-95.2013.8.24.0023 (023.13.006967-4) - Ação Penal
[...]
a presença dos réus no ato deprecado.
--
ADV: RODRIGO TADEU PIMENTA DE OLIVEIRA (OAB 16752/
SC)
Processo 0016809-02.2013.8.24.0023 - Ação Penal Militar -
[...]
prazo de 05 (cinco) dias.
--
ADV: RODRIGO TADEU PIMENTA DE OLIVEIRA (OAB 16752/
SC), ELIAS NOVAIS PEREIRA (OAB 30513/SC), ROBSON LUIZ 
CERON (OAB 22475/SC)
Processo 0021741-33.2013.8.24.0023 - Ação Penal Militar -
[...]
a presença dos réus no ato deprecado.
--
ADV: RODRIGO TADEU PIMENTA DE OLIVEIRA (OAB 16752/
SC)
Processo 0024568-17.2013.8.24.0023 - Ação Penal Militar -
[...]
do CPPM
--
ADV: RODRIGO TADEU PIMENTA DE OLIVEIRA (OAB 16752/
SC)
Processo 0034522-87.2013.8.24.0023 - Ação Penal Militar -
[...]
diligências, consoante o art. 427 do CPPM
--
ADV: SANDRO MARCELO PEROTTI (OAB 8949/SC), NOEL 
ANTÔNIO BARATIERI (OAB 16462/SC), RODRIGO TADEU 
PIMENTA DE OLIVEIRA (OAB 16752/SC)
Processo 0041634-10.2013.8.24.0023 - Ação Penal Militar -
Procedimento Ordinário - Crimes Militares - Autor: M. P. E. - Réu: J. P. 
D. - Defiro a juntada dos documentos de pp. 3214-3217. Oficie-se com
urgência à Comarca de Porto União (ref. Carta Precatória n. 0000463-
--
15.2015.8.24.0052), informando a habilitação dos procuradores. Intime-
se, inclusive os novos constituídos da designação do ato.
--
ADV: SANDRO MARCELO PEROTTI (OAB 8949/SC), NOEL 
ANTÔNIO BARATIERI (OAB 16462/SC), RODRIGO TADEU 
PIMENTA DE OLIVEIRA (OAB 16752/SC)
Processo 0041634-10.2013.8.24.0023 - Ação Penal Militar -
[...]
imprescindível a presença dos réus no ato deprecado.
--
ADV: RODRIGO TADEU PIMENTA DE OLIVEIRA (OAB 16752/
SC)
Processo 0043998-52.2013.8.24.0023 - Ação Penal Militar -
[...]
de parcelas para desconto remuneratório. Intimem-se.
--
ADV: RODRIGO TADEU PIMENTA DE OLIVEIRA (OAB 16752/
SC)
Processo 0049304-02.2013.8.24.0023 - Ação Penal Militar -
[...]
Rel. Ângela Maria Silveira).
--
ADV: ROBSON LUIZ CERON (OAB 22475/SC)
Processo 0000421-87.2014.8.24.0023 - Ação Penal Militar -
[...]
prazo de 05 (cinco) dias.
--
ADV: RODRIGO TADEU PIMENTA DE OLIVEIRA (OAB 16752/
SC)
Processo 0003198-45.2014.8.24.0023 - Ação Penal Militar -
[...]
de 05 (cinco) dias.
--
ADV: ISAEL MARCELINO COELHO (OAB 13878/SC), ROBSON 
LUIZ CERON (OAB 22475/SC)
Processo 0010380-82.2014.8.24.0023 - Ação Penal Militar -
Procedimento Ordinário - Crimes Militares - Autor: Ministério Público
Estadual - Réu: Vilson Diocimar Antunes - HOMOLOGO o pedido 
de desistência. Intime-se a defesa para o que preceitua o artigo 417, 
§2º, do Código de Processo Penal Militar.

(shortened a bit for obvious reasons)

Divide at colored headers

In a comment the OP wrote:

One little thing more, how can I identify a font size /color change inside section? I need that in some cases where there is no divider (only a bigger Title) (example page 346,"Armazém" should end the section)

As an example I extended the DividerAwareTextExtrationStrategy above to add the ascender lines of text in a given color to the already found divider lines:

public class DividerAndColorAwareTextExtractionStrategy extends DividerAwareTextExtrationStrategy
{
    //
    // constructor
    //
    public DividerAndColorAwareTextExtractionStrategy(float topMargin, float bottomMargin, float leftMargin, float rightMargin, BaseColor headerColor)
    {
        super(topMargin, bottomMargin, leftMargin, rightMargin);
        this.headerColor = headerColor;
    }

    //
    // DividerAwareTextExtrationStrategy overrides
    //
    /**
     * As the {@link DividerAwareTextExtrationStrategy#lines} are not
     * properly sorted anymore (the additional lines come after all
     * divider lines of the same column), we have to sort that {@link List}
     * first.
     */
    @Override
    public List<Section> getSections()
    {
        Collections.sort(lines, new Comparator<LineSegment>()
        {
            @Override
            public int compare(LineSegment o1, LineSegment o2)
            {
                Vector start1 = o1.getStartPoint();
                Vector start2 = o2.getStartPoint();

                float v1 = start1.get(Vector.I1), v2 = start2.get(Vector.I1);
                if (Math.abs(v1 - v2) < 2)
                {
                    v1 = start2.get(Vector.I2);
                    v2 = start1.get(Vector.I2);
                }

                return Float.compare(v1, v2);
            }
        });

        return super.getSections();
    }

    /**
     * The ascender lines of text rendered using a fill color approximately
     * like the given header color are added to the divider lines.
     */
    @Override
    public void renderText(TextRenderInfo renderInfo)
    {
        if (approximates(renderInfo.getFillColor(), headerColor))
        {
            lines.add(renderInfo.getAscentLine());
        }

        super.renderText(renderInfo);
    }

    /**
     * This method checks whether two colors are approximately equal. As the
     * sample document only uses CMYK colors, only this comparison has been
     * implemented yet.
     */
    boolean approximates(BaseColor colorA, BaseColor colorB)
    {
        if (colorA == null || colorB == null)
            return colorA == colorB;
        if (colorA instanceof CMYKColor && colorB instanceof CMYKColor)
        {
            CMYKColor cmykA = (CMYKColor) colorA;
            CMYKColor cmykB = (CMYKColor) colorB;
            float c = Math.abs(cmykA.getCyan() - cmykB.getCyan());
            float m = Math.abs(cmykA.getMagenta() - cmykB.getMagenta());
            float y = Math.abs(cmykA.getYellow() - cmykB.getYellow());
            float k = Math.abs(cmykA.getBlack() - cmykB.getBlack());
            return c+m+y+k < 0.01;
        }
        // TODO: Implement comparison for other color types
        return false;
    }

    final BaseColor headerColor;
}

(DividerAndColorAwareTextExtractionStrategy.java)

In renderText we recognize texts in the headerColor and add their respective top line to the lines list.

Beware: we add the ascender line of each chunk in the given color. We actually should join the ascender lines of all text chunks forming a single header line. As the blue header lines in the sample document consist of merely a single chunk, we don't need to in this sample code. A generic solution would have to be appropriately extended.

As the lines are not properly sorted anymore (the additional ascender lines come after all divider lines of the same column), we have to sort that list first.

Please be aware that the Comparator used here is not really proper: It ignores a certain difference in the x coordinate which makes it not really transitive. It only works if the individual lines of the same column have approximately the same starting x coordinate differing clearly from those of different columns.

In a test run (cf. DividerAndColorAwareTextExtraction.java method test20150211600_346) the found sections are also split at the blue headings "Armazém" and "Balneário Camboriú".

Please be aware of the restrictions I pointed out above. If e.g. you want to split at the grey headings in your sample document, you'll have to improve the methods above as those headings don't come in a single chunk.

like image 130
mkl Avatar answered Oct 05 '22 22:10

mkl