I have a link like
http://www.downloads.com/help.pdf
I want to download this, and parse it to get the text content.
How do I go about this? I also plan to tag-ize(if there is a word like that) the extracted text
Using the PyPDF2 module Then we have the getPage() method to get the page from the PDF file using the page index which starts from 0, and finally the extractText() method which is used to extract the text from the PDF file page.
You can either use the pdf-reader gem (the example/text.rb example is simple and worked for me): https://github.com/yob/pdf-reader
Or the command-line utility pdftotext.
The Yomu gem will also be able to extract the text from a PDF (as well as other MIME types) for you.
require 'yomu'
Yomu.new(file_path).text
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With