Extract Images and Words with coordinates and sizes from PDF

Question

I've read much about PDF extractions and libraries (as iText) but i just haven't found a solution to extract images and text (with coordinates) from a PDF.

The task is to scan PDF with catalog of products and extract each image. There is an image code printed next to each image and also a list of product codes for products that are shown on the image.

I know that there is no way to extract structured info from a PDF like this but with coordinates of all image and text objects I could write code to identify linked text by its distance from the image. Then I could split text using a RegExp and find out what is a product code, what is an image code etc.

Could you recommend a good and working solution for the task?

Balamurugan Muthiah · Accepted Answer

Use XPDF (http://www.foolabs.com/xpdf/)

It can extract all the characters in the PDF with co-ordinates (pdftotext -bbox [sourcefile] [outputfile]) and also all the images and SVGs in the PDF.

It's open source (GPLv2) and supports a lot of additional extraction functionalities as well.

Extract Images and Words with coordinates and sizes from PDF

Tags:

image

pdf

words

extraction

coordinates

Alex

1 Answers

Balamurugan Muthiah

Recent Activity

Donate For Us

Extract Images and Words with coordinates and sizes from PDF

Tags:

image

pdf

words

extraction

coordinates

Alex

1 Answers

Balamurugan Muthiah

Related questions

Recent Activity

Donate For Us