How to read pdf file to a text file in a proper format using Spire.PDF or any other library?

Question

How can I read pdf files and save contents to a text file using Spire.PDF? For example: Here is a pdf file and here is the desired text file from that pdf

I tried the below code to read the file and save it to a text file

PdfDocument doc = new PdfDocument();
doc.LoadFromFile(@"C:\Users\Tamal\Desktop\101395a.pdf");

StringBuilder buffer = new StringBuilder();

foreach (PdfPageBase page in doc.Pages)
{
    buffer.Append(page.ExtractText());
}

doc.Close();
String fileName = @"C:\Users\Tamal\Desktop\101395a.txt";
File.WriteAllText(fileName, buffer.ToString());
System.Diagnostics.Process.Start(fileName);

But the output text file is not properly formatted. It has unnecessary whitespaces and a complete para is broken into multiple lines etc.

How do I get the desired result as in the desired text file?

Additionally, it is possible to detect and mark(like add a tag) to texts with bold, italic or underline forms as well? Also things get more problematic for pages have multiple columns of text.

Joris Schellekens · Accepted Answer

Using iText

File inputFile = new File("input.pdf");

PdfDocument pdfDocument = new PdfDocument(new PdfReader(inputFile));

SimpleTextExtractionStrategy stes = new SimpleTextExtractionStrategy();
PdfCanvasProcessor canvasProcessor = new PdfCanvasProcessor(stes);
canvasProcessor.processPageContent(pdfDocument.getPage(1));

System.out.println(stes.getResultantText());

This is (as the code says) a basic/simple text extraction strategy. More advanced examples can be found in the documentation.

Krunal Soni · Answer

Use IronOCR

var Ocr = new IronOcr.AutoOcr();
var Results = Ocr.ReadPdf("E:\Demo.pdf");
File.WriteAllText("E:\Demo.txt", Convert.ToString(Results));

For reference https://ironsoftware.com/csharp/ocr/

Using this you should get formatted text output, but not exact desire output which you want.

If you want exact pre-interpreted output, then you should check paid OCR services like OmniPage capture SDK & Abbyy finereader SDK

How to read pdf file to a text file in a proper format using Spire.PDF or any other library?

Tags:

c#

pdf

ocr

Tamal Banerjee

2 Answers

Joris Schellekens

Krunal Soni

Recent Activity

Donate For Us

How to read pdf file to a text file in a proper format using Spire.PDF or any other library?

Tags:

c#

pdf

ocr

Tamal Banerjee

2 Answers

Joris Schellekens

Krunal Soni

Related questions

Recent Activity

Donate For Us