Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read pdf file to a text file in a proper format using Spire.PDF or any other library?

Tags:

c#

pdf

ocr

How can I read pdf files and save contents to a text file using Spire.PDF? For example: Here is a pdf file and here is the desired text file from that pdf

I tried the below code to read the file and save it to a text file

PdfDocument doc = new PdfDocument();
doc.LoadFromFile(@"C:\Users\Tamal\Desktop\101395a.pdf");

StringBuilder buffer = new StringBuilder();

foreach (PdfPageBase page in doc.Pages)
{
    buffer.Append(page.ExtractText());
}

doc.Close();
String fileName = @"C:\Users\Tamal\Desktop\101395a.txt";
File.WriteAllText(fileName, buffer.ToString());
System.Diagnostics.Process.Start(fileName);

But the output text file is not properly formatted. It has unnecessary whitespaces and a complete para is broken into multiple lines etc.

How do I get the desired result as in the desired text file?

Additionally, it is possible to detect and mark(like add a tag) to texts with bold, italic or underline forms as well? Also things get more problematic for pages have multiple columns of text.

like image 258
Tamal Banerjee Avatar asked May 26 '18 02:05

Tamal Banerjee


2 Answers

Using iText

File inputFile = new File("input.pdf");

PdfDocument pdfDocument = new PdfDocument(new PdfReader(inputFile));

SimpleTextExtractionStrategy stes = new SimpleTextExtractionStrategy();
PdfCanvasProcessor canvasProcessor = new PdfCanvasProcessor(stes);
canvasProcessor.processPageContent(pdfDocument.getPage(1));

System.out.println(stes.getResultantText());

This is (as the code says) a basic/simple text extraction strategy. More advanced examples can be found in the documentation.

like image 177
Joris Schellekens Avatar answered Sep 28 '22 04:09

Joris Schellekens


Use IronOCR

var Ocr = new IronOcr.AutoOcr();
var Results = Ocr.ReadPdf("E:\Demo.pdf");
File.WriteAllText("E:\Demo.txt", Convert.ToString(Results));

For reference https://ironsoftware.com/csharp/ocr/

Using this you should get formatted text output, but not exact desire output which you want.

If you want exact pre-interpreted output, then you should check paid OCR services like OmniPage capture SDK & Abbyy finereader SDK

like image 29
Krunal Soni Avatar answered Sep 28 '22 04:09

Krunal Soni