I have multiple PDF documents in a folder that have a certain structure:
Now I want to be able to parse the information from the PDF. Please note that the paragraphs have varying lengths.
Obviously I am not asking you to solve the problem for me, but I do need some pointers as to how this can be achieved.
I have used nokogiri before and technically I need something like that but for PDFs.
So the pseudo result for my example would look like this:
- ItemA
- Title: ItemA
- File: 123456789.pdf
- Image: ImageA.png (the image was stored on disk)
- Subtitle1: Content for subtitle 1
- Subtitle2: Content for subtitle 2
- Subtitle3: Content for subtitle 3
- TitleB
- [...]
The text can easily be parsed like so:
# gem install pdf-reader
require 'pdf-reader'
reader = PDF::Reader.new('my.pdf')
reader.pages.each do |page|
puts page.text
end
This can be done with the same library. See the example script examples/extract_images.rb.
This is (not yet) a complete answer. The next steps would now be to:
pdf-reader
is one of the solution. But it has issues sometimes it doesn't give text in proper format. I have used it.
I will suggest to use docsplit . You will find more information about 'pdf-reader' and 'docsplit' in this blog post.
Hope this helps. In case any clarification is required, feel free to comment.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With