I have a PDF file, which contains data that we need to import into a database. The files seem to be pdf scans of printed alphanumeric text. Looks like 10 pt. Times New Roman. Are there any tools or components that can will allow me to recognize and parse this text?

At a company I used to work for, we used ActivePDF toolkit with some success: http://www.activepdf.com/products/serverproducts/toolkit/index.cfm I think you'd need at least the Standard or Pro version but they have trials so you can see if it'll do what you want it to.

Programmatically recognize text from scans in a PDF File [closed]

5 Answers

I've used pdftohtml to successfully strip tables out of PDF into CSV. It's based on Xpdf, which is a more general purpose tool, that includes pdftotext. I just wrap it as a Process.Start call from C#.

If you're looking for something a little more DIY, there's the iTextSharp library - a port of Java's iText - and PDFBox (yes, it says Java - but they have a .NET version by way of IKVM.NET). Here's some CodeProject articles on using iTextSharp and PDFBox from C#.

And, if you're really a masochist, you could call into Adobe's PDF IFilter with COM interop. The IFilter specs is pretty simple, but I would guess that the interop overhead would be significant.

Edit: After re-reading the question and subsequent answers, it's become clear that the OP is dealing with images in his PDF. In that case, you'll need to extract the images (the PDF libraries above are able to do that fairly easily) and run it through an OCR engine.

I've used MODI interactively before, with decent results. It's COM, so calling it from C# via interop is also doable and pretty simple:

' lifted from http://en.wikipedia.org/wiki/Microsoft_Office_Document_Imaging
Dim inputFile As String = "C:\test\multipage.tif"
Dim strRecText As String = ""
Dim Doc1 As MODI.Document

Doc1 = New MODI.Document
Doc1.Create(inputFile)
Doc1.OCR()  ' this will ocr all pages of a multi-page tiff file
Doc1.Save() ' this will save the deskewed reoriented images, and the OCR text, back to the inputFile

For imageCounter As Integer = 0 To (Doc1.Images.Count - 1) ' work your way through each page of results
   strRecText &= Doc1.Images(imageCounter).Layout.Text    ' this puts the ocr results into a string
Next

File.AppendAllText("C:\test\testmodi.txt", strRecText)     ' write the OCR file out to disk

Doc1.Close() ' clean up
Doc1 = Nothing

Others like Tesseract, but I have direct experience with it. I've heard both good and bad things about it, so I imagine it greatly depends on your source quality.

151

answered Oct 04 '22 04:10

Mark Brackett

You can't extract scanned text from a PDF. You need OCR software. The good news is there are a few open source applications you can try and the OCR route will most likely be easier than using a PDF library to extract text. Check out Tesseract and GOCR.

answered Oct 04 '22 02:10

jm4

I have posted about parsing pdf's in one of my blogs. Hit this link:

http://devpinoy.org/blogs/marl/archive/2008/03/04/pdf-to-text-using-open-source-library-pdfbox-another-sample-for-grade-1-pupils.aspx

Edit: Link no long works. Below quoted from http://web.archive.org/web/20130507084207/http://devpinoy.org/blogs/marl/archive/2008/03/04/pdf-to-text-using-open-source-library-pdfbox-another-sample-for-grade-1-pupils.aspx

Well, the following is based on popular examples available on the web. What this does is "read" the pdf file and output it as a text in the rich text box control in the form. The PDFBox for .NET library can be downloaded from sourceforge.

You need to add reference to IKVM.GNU.Classpath & PDFBox-0.7.3. And also, FontBox-0.1.0-dev.dll and PDFBox-0.7.3.dll need to be added on the bin folder of your application. For some reason I can't recall (maybe it's from one of the tutorials), I also added to the bin IKVM.GNU.Classpath.dll.

On the side note, just got my copy of "Head First C#" (on Keith's suggestion) from Amazon. The book is cool! It is really written for beginners. This edition covers VS2008 and the framework 3.5.

Here you go...

/* Marlon Ribunal
 * Convert PDF To Text
 * *******************/

using System;
using System.Collections.Generic;
using System.Drawing;
using System.Windows.Forms;
using System.Drawing.Printing;
using System.IO;
using System.Text;
using System.ComponentModel.Design;
using System.ComponentModel;
using org.pdfbox.pdmodel;
using org.pdfbox.util;

namespace MarlonRibunal.iPdfToText
{
    public partial class MainForm : Form
    {
        public MainForm()
        {
            InitializeComponent(); 
        }

        void Button1Click(object sender, EventArgs e)    
        {    
            PDDocument doc = PDDocument.load("C:\\pdftoText\\myPdfTest.pdf");
            PDFTextStripper stripper = new PDFTextStripper();
            richTextBox1.Text=(stripper.getText(doc));
        }

     }
}

answered Oct 04 '22 02:10

MarlonRibunal

At a company I used to work for, we used ActivePDF toolkit with some success:

http://www.activepdf.com/products/serverproducts/toolkit/index.cfm

I think you'd need at least the Standard or Pro version but they have trials so you can see if it'll do what you want it to.

answered Oct 04 '22 02:10

Dana

A quick google search shows this promising result. http://www.pdftron.com/net/index.html

answered Oct 04 '22 02:10

Sijin

Related questions
                            
                                tcpdf - start with existing PDF document
                            
                                Sphinx PDF themes
                            
                                how to display base64 encoded pdf?
                            
                                How to print a PDF from the browser
                            
                                Knitr wont compile PDF: "Error in tools::file_path_as_absolute(output_file)"
                            
                                Extracting table contents from a collection of PDF files [closed]
                            
                                Generating header/footer with flying saucer (xHTMLRenderer) and iText
                            
                                How to handle PDF pagination in PhantomJS
                            
                                Create pdf with tooltips in R
                            
                                Generate pdf with jspdf and Vue.js
                            
                                How to fill PDF form in php
                            
                                How do you add a page break in a PDF with XSL-FO?
                            
                                dompdf fails to load
                            
                                How can I change paper size in headless Chrome --print-to-pdf
                            
                                Can I set the filename of a PDF object displayed in Chrome?
                            
                                Extract TOC of PDF?
                            
                                Getting PdfStamper to work with MemoryStreams (c#, itextsharp)
                            
                                How to convert PDF to HTML?
                            
                                How to read line by line in pdf file using PyPdf?
                            
                                How do you debug PDF files?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Programmatically recognize text from scans in a PDF File [closed]

Tags:

pdf

ocr

Rob

People also ask

5 Answers

Mark Brackett

jm4

MarlonRibunal

Dana

Sijin

Recent Activity

Donate For Us