Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract text from PDF(I have link to PDF) in ruby

Tags:

ruby

pdf

I have a link like

      http://www.downloads.com/help.pdf

I want to download this, and parse it to get the text content.

How do I go about this? I also plan to tag-ize(if there is a word like that) the extracted text

like image 440
theReverseFlick Avatar asked Feb 05 '11 05:02

theReverseFlick


People also ask

How do I extract text from a PDF using PyPDF2?

Using the PyPDF2 module Then we have the getPage() method to get the page from the PDF file using the page index which starts from 0, and finally the extractText() method which is used to extract the text from the PDF file page.


2 Answers

You can either use the pdf-reader gem (the example/text.rb example is simple and worked for me): https://github.com/yob/pdf-reader

Or the command-line utility pdftotext.

like image 62
seeingidog Avatar answered Sep 28 '22 19:09

seeingidog


The Yomu gem will also be able to extract the text from a PDF (as well as other MIME types) for you.

require 'yomu'
Yomu.new(file_path).text
like image 22
diasks2 Avatar answered Sep 28 '22 20:09

diasks2