I want to extract the contents of a table in pdf like like this :
i wrote this java programme using iText java PDF libray which can read the contents of a PDF file line by line, but I do not know how to get the contents of table
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;
public class PDFReader {
public static void main(String[] args) {
// TODO, add your application code
System.out.println("Lecteur PDF");
System.out.println (ReadPDF("D:/test.pdf"));
}
private static String ReadPDF(String pdf_url)
{
StringBuilder str=new StringBuilder();
try
{
PdfReader reader = new PdfReader(pdf_url);
int n = reader.getNumberOfPages();
for(int i=1;i<n;i++)
{
String str2=PdfTextExtractor.getTextFromPage(reader, i);
str.append(str2);
System.out.println(str);
}
}catch(Exception err)
{
err.printStackTrace();
}
return String.format("%s", str);
}
}
this is what I get :
but that's not what I want, I want to extract the contents of the table line by line and column by column, for example, save each line in an java array
the first array will contain : "N°", "DATE OBSERVATIONS", "TEXTE"
the second array will contain : "029/14", "Le 1er sept 2014 remplace AVURNAV...", "SETE A compter du lundi 7 juillet 2014 débuteront les trav..."
the third array will contain : "037/14", "Le 15 octobre 2014 remplace AVURNAV ...", "SETE Du 15 septembre 2014 au 15 juillet 2015, travaux ...."
and so on
Thanks
You may have to identify common field beginning/end character sequences to split your data into an array if your PDF library doesn't support extracting tables.
For instance the first fields is nnn/nn
, the second field ends nnnn/nn
and the third field ends where the next first field begins.
This is a tricky problem - I have had to use coordinate based approaches to deal with this before, but your pdf library may not support extracting the position of letters as well as the actual text.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With