Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract the contents of a table in pdf file? [duplicate]

I want to extract the contents of a table in pdf like like this :

enter image description here

i wrote this java programme using iText java PDF libray which can read the contents of a PDF file line by line, but I do not know how to get the contents of table

import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;

public class PDFReader {

    public static void main(String[] args) {

        // TODO, add your application code
        System.out.println("Lecteur PDF");
        System.out.println (ReadPDF("D:/test.pdf"));
    }
        private static String ReadPDF(String pdf_url)
    {
        StringBuilder str=new StringBuilder();
        try
        {

         PdfReader reader = new PdfReader(pdf_url);
        int n = reader.getNumberOfPages();
         for(int i=1;i<n;i++)
         {
            String str2=PdfTextExtractor.getTextFromPage(reader, i);
            str.append(str2);
           System.out.println(str);
         }
        }catch(Exception err)
        {
            err.printStackTrace();
        }
        return String.format("%s", str);
    }
}

this is what I get :

enter image description here

but that's not what I want, I want to extract the contents of the table line by line and column by column, for example, save each line in an java array

the first array will contain : "N°", "DATE OBSERVATIONS", "TEXTE"

the second array will contain : "029/14", "Le 1er sept 2014 remplace AVURNAV...", "SETE A compter du lundi 7 juillet 2014 débuteront les trav..."

the third array will contain : "037/14", "Le 15 octobre 2014 remplace AVURNAV ...", "SETE Du 15 septembre 2014 au 15 juillet 2015, travaux ...."

and so on

Thanks

like image 432
Bertrand Avatar asked Jul 09 '15 22:07

Bertrand


1 Answers

You may have to identify common field beginning/end character sequences to split your data into an array if your PDF library doesn't support extracting tables. For instance the first fields is nnn/nn, the second field ends nnnn/nn and the third field ends where the next first field begins.

This is a tricky problem - I have had to use coordinate based approaches to deal with this before, but your pdf library may not support extracting the position of letters as well as the actual text.

like image 87
3-14159265358979323846264 Avatar answered Oct 19 '22 19:10

3-14159265358979323846264