Extract text content from PDF

Question

I have been extracting text from PDFs using pdftotext. I have also done this with Ghostscript. Recently, a utility provider changed their PDFs so a portion of it is not being extracted by these methods. Specifically, I'm missing the due date and total due. When I open the PDF in a reader, the 'missing' text can be highlighted, copied, and pasted into an external editor. When I open it in Acrobat Pro, and view the content (View -> Show/Hide -> Navigation Panes -> Content), the text I need is there. How can I get it out without manually copying and pasting? (which is not an option, because I'll be doing this on thousands of PDFs)?

Here an example of what I'm dealing with. I have removed all sensitive data:

link to PDF

EDIT: I noticed after posting this that when you follow the link to the file (hosted on Google Drive), it will allow you to select and copy most text on page, but not the stuff I'm missing. When you download the file, you are able to select the missing text in a PDF reader.

Here an example of what I'm dealing with. I have removed all sensitive data:

link to PDF

EDIT: I noticed after posting this that when you follow the link to the file (hosted on Google Drive), it will allow you to select and copy most text on page, but not the stuff I'm missing. When you download the file, you are able to select the missing text in a PDF reader.

chrisl · Accepted Answer

Recent releases of Ghostscript have a txtwrite device which is probably worth trying.

Extract text content from PDF

Tags:

pdf

ghostscript

pdftotext

Ben Walker

1 Answers

chrisl

Recent Activity

Donate For Us

Extract text content from PDF

Tags:

pdf

ghostscript

pdftotext

Ben Walker

1 Answers

chrisl

Related questions

Recent Activity

Donate For Us