Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pdfbox header version info error

I used PDFbox for parsing that pdf document.It throws exception that says it can not find header version info . Any idea?

I think version is 1.3 I saw it when I cast every byte to char . link is http://www.selab.isti.cnr.it/ws-mate/example.pdf

here codes of method and output:

 public String PDFtest(String textLink) throws IOException{
        PDFParser parser;
        String parsedText = null;
        PDFTextStripper pdfStripper;
        PDDocument pdDoc;
        COSDocument cosDoc;
        PDDocumentInformation pdDocInfo;


    StringBuilder sd=new StringBuilder();
    URL link;
    try {
        link = new URL(textLink);
        URLConnection urlConn = link.openConnection();
        BufferedInputStream in = null;
        in = new BufferedInputStream(urlConn.getInputStream());
        byte data[] = new byte[1024];
        in.read(data, 0, 1024);

    parser = new PDFParser(in);
    parser.parse();
    cosDoc = parser.getDocument();
    pdfStripper = new PDFTextStripper();
    pdDoc = new PDDocument(cosDoc);
    parsedText = pdfStripper.getText(pdDoc);
    } catch (MalformedURLException ex) {
        Logger.getLogger(HTMLhelper.class.getName()).log(Level.SEVERE, null, ex);
    }
    catch (NumberFormatException e){
        System.out.println("hata");
    }

    return parsedText;



}

Exception:

Exception in thread "main" java.io.IOException: Error: Header doesn't contain versioninfo
    at org.apache.pdfbox.pdfparser.PDFParser.parseHeader(PDFParser.java:317)
    at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:173)
    at ParsingMachine.HTMLhelper.PDFtest(HTMLhelper.java:99)
    at ParsingMachine.tester.main(tester.java:18)
Java Result: 1
like image 564
user2638084 Avatar asked Sep 25 '13 19:09

user2638084


2 Answers

You must be merging a file which is not in pdf format. Please check carefully if you have any file in the list other then pdf.

like image 78
asraniinfo Avatar answered Nov 04 '22 02:11

asraniinfo


In my case, I was iterating through the files in a directory.
Windows has a Thumbs.db file in any directory.
This was interfering with the pdf file process.
Applying a filter to only pick PDF files (*.pdf) helped.
Cheers.

like image 2
murphy1310 Avatar answered Nov 04 '22 04:11

murphy1310