I am building an OCR project and I am using a .Net wrapper for Tesseract. The samples that the wrapper have don't show how to deal with a PDF as input. Using a PDF as input how do I produce a searchable PDF using c#?
how can i get text from Pdf with saving the shape of original Pdf
this is a page from pdf i don't want only text i want the text to be in the shapes like the original pdf and sorry for poor English
Just for documentation reasons, here is an example of OCR
using tesseract
and pdf2image
to extract text from an image pdf.
import pdf2image try: from PIL import Image except ImportError: import Image import pytesseract def pdf_to_img(pdf_file): return pdf2image.convert_from_path(pdf_file) def ocr_core(file): text = pytesseract.image_to_string(file) return text def print_pages(pdf_file): images = pdf_to_img(pdf_file) for pg, img in enumerate(images): print(ocr_core(img)) print_pages('sample.pdf')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With