Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract text content from PDF

I have been extracting text from PDFs using pdftotext. I have also done this with Ghostscript. Recently, a utility provider changed their PDFs so a portion of it is not being extracted by these methods. Specifically, I'm missing the due date and total due. When I open the PDF in a reader, the 'missing' text can be highlighted, copied, and pasted into an external editor. When I open it in Acrobat Pro, and view the content (View -> Show/Hide -> Navigation Panes -> Content), the text I need is there. How can I get it out without manually copying and pasting? (which is not an option, because I'll be doing this on thousands of PDFs)?

Here an example of what I'm dealing with. I have removed all sensitive data:

link to PDF

EDIT: I noticed after posting this that when you follow the link to the file (hosted on Google Drive), it will allow you to select and copy most text on page, but not the stuff I'm missing. When you download the file, you are able to select the missing text in a PDF reader.

like image 950
Ben Walker Avatar asked Jun 16 '26 20:06

Ben Walker


1 Answers

Recent releases of Ghostscript have a txtwrite device which is probably worth trying.

like image 156
chrisl Avatar answered Jun 20 '26 15:06

chrisl



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!