Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract Data from .PDF files

Tags:

I need to extract data from .PDF files and load it in to SQL 2008. Can any one tell me how to proceed??

like image 486
S.. Avatar asked Jan 24 '11 16:01

S..


People also ask

Is it possible to extract data from PDF?

You can extract data from PDF files directly into Excel. First, you'll need to import your PDF file. Once you import the file, use the extract data button to begin the extraction process. You should see several instruction windows that will help you extract the selected data.

How do I extract specific data from a PDF?

Once the PDF form is open in the program, click on the "Form" > "Extra Data" button, and then select the "Extract Data" option. A new dialogue window will appear. You can then select the option of "Extract data from form fields in PDF ". Then click "Apply" to proceed.

Can I extract data from a PDF to Excel?

Open a PDF file in Acrobat.Click on the “Export PDF” tool in the right pane. Choose “spreadsheet” as your export format, and then select “Microsoft Excel Workbook.” Click “Export.” If your PDF documents contain scanned text, Acrobat will run text recognition automatically.


1 Answers

Here is an example of how to use iTextSharp to extract text data from a PDF. You'll have to fiddle with it some to make it do exactly what you want, I think it's a good outline. You can see how the StringBuilder is being used to store the text, but you could easily change that to use SQL.

    static void Main(string[] args)     {         PdfReader reader = new PdfReader(@"c:\test.pdf");          StringBuilder builder = new StringBuilder();          for (int x = 1; x <= reader.NumberOfPages; x++)         {             PdfDictionary page = reader.GetPageN(x);             IRenderListener listener = new SBTextRenderer(builder);             PdfContentStreamProcessor processor = new PdfContentStreamProcessor(listener);             PdfDictionary pageDic = reader.GetPageN(x);             PdfDictionary resourcesDic = pageDic.GetAsDict(PdfName.RESOURCES);             processor.ProcessContent(ContentByteUtils.GetContentBytesForPage(reader, x), resourcesDic);         }     }  public class SBTextRenderer : IRenderListener {      private StringBuilder _builder;     public SBTextRenderer(StringBuilder builder)     {         _builder = builder;     }     #region IRenderListener Members      public void BeginTextBlock()     {     }      public void EndTextBlock()     {     }      public void RenderImage(ImageRenderInfo renderInfo)     {     }      public void RenderText(TextRenderInfo renderInfo)     {         _builder.Append(renderInfo.GetText());     }      #endregion } 
like image 55
Daniel Ahrnsbrak Avatar answered Sep 17 '22 17:09

Daniel Ahrnsbrak