Tesseract ocr PDF as input

Question

I am building an OCR project and I am using a .Net wrapper for Tesseract. The samples that the wrapper have don't show how to deal with a PDF as input. Using a PDF as input how do I produce a searchable PDF using c#?

I have use ghostscript library to change Pdf to image then feed Tesseract with it and it's working great getting the text but i doesn't save the original shape of Pdf i only get text

how can i get text from Pdf with saving the shape of original Pdf

enter image description here

this is a page from pdf i don't want only text i want the text to be in the shapes like the original pdf and sorry for poor English

Kostas Charitidis · Accepted Answer

Just for documentation reasons, here is an example of OCR using tesseract and pdf2image to extract text from an image pdf.

import pdf2image try:     from PIL import Image except ImportError:     import Image import pytesseract   def pdf_to_img(pdf_file):     return pdf2image.convert_from_path(pdf_file)   def ocr_core(file):     text = pytesseract.image_to_string(file)     return text   def print_pages(pdf_file):     images = pdf_to_img(pdf_file)     for pg, img in enumerate(images):         print(ocr_core(img))   print_pages('sample.pdf')

Tesseract ocr PDF as input

Tags:

acrab

1 Answers

Kostas Charitidis

Recent Activity

Donate For Us

Tesseract ocr PDF as input

Tags:

acrab

1 Answers

Kostas Charitidis

Related questions

Recent Activity

Donate For Us