Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read specific value based on label name from PDF in C#

I have an asp.net Core 2.0 C# application which read/parse the PDF file and get the text. In this I want to read specific value which have specific label name. You can see the below image I want to get the value 171857 which is Invoice number and store it in database. enter image description here

I have tried below code to read the pdf using iTextSharp.

using (PdfReader reader = new PdfReader(fileName))
        {
            StringBuilder sb = new StringBuilder();

            ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
            for (int page = 0; page < reader.NumberOfPages; page++)
            {
                string text = PdfTextExtractor.GetTextFromPage(reader, page + 1, strategy);
                if (!string.IsNullOrWhiteSpace(text))
                {
                    sb.Append(Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(text))));
                }
            }

            var pdfText = sb.ToString();
        }

In pdfText variable I will get all text content from pdf but It seems that this is not the proper way to get the Invoice number. Is there any other way to read the specific content from pdf by it's label name like we will provide label name Invoice and it will return the value 171857 as example with other 3rd party pdf reader libraries?

Any help or suggestions would be highly appreciated.

Thanks

like image 297
prog1011 Avatar asked Dec 22 '22 23:12

prog1011


2 Answers

I have helped a friend extracting similar value from pdf invoice generated by Excel arc. I have for this answer created an Excel invoice and print it as PDF file and zipped for download for testing purpose.

The next thing I do, I am using an Open Source and Free Library called PDFClown. Here is the nuget package for it.

So far so good, what I did is I scan all pdf document (for example invoice can be one page or multiple pages) add each content to a list of string.

The next step I find the index (the invoice number index could be in 10th element in list, in our case it is index 1) that refer to invoice value which I will call Tag or Label.

Hence I do not have your pdf file, I improvised and added a unique Tag called (or any other name) "INVOICE". The invoice number in this case comes after invoice tag tag. So I find the index of "INVOICE" tag and add 1 to index this is because the invoice number follow the invoice tag. This way I will pick the invoice text 0005 in this case and return it as value 5. This way you can fetch what every text/value followed by any tag scanned in our list and return it the way that you need.

So you need to play with it a bit to fit it 100% to your pdf file.

So here is my test files Excel and Pdf zipped down. Download it for your test.

Here is the code:

public class InvoiceTextExtraction
{
    private List<string> _contentList;

    public void GetValueFromPdf()
    {
        _contentList = new List<string>();
        CreatePdfContent(@"C:\temp\Invoice1.pdf");

        var index = _contentList.FindIndex(e => e == "INVOICE") + 1;
        int.TryParse(_contentList[index], out var value);
        Console.WriteLine(value);
    }


    public void CreatePdfContent(string filePath)
    {
        using (var file = new File(filePath))
        {
            var document = file.Document;

            foreach (var page in document.Pages)
            {
                Extract(new ContentScanner(page));
            }
        }
    }

    private void Extract(ContentScanner level)
    {
        if (level == null)
            return;

        while (level.MoveNext())
        {
            var content = level.Current;
            switch (content)
            {
                case ShowText text:
                {
                    var font = level.State.Font;
                    _contentList.Add(font.Decode(text.Text));
                    break;
                }
                case Text _:
                case ContainerObject _:
                    Extract(level.ChildLevel);
                    break;
            }
        }
    }
}

Input extracted from pdf file. The code scan return following elements:

INVOICE
0005

PAYMENT DUE BY:
4/19/2019
.etc
.
.
.
Tax
USD TOTAL
171857
18 september 2019

and here is the result

5

The code is inspired from this link.

like image 72
Maytham Avatar answered Dec 28 '22 09:12

Maytham


Assuming that the invoice label and invoice number is embedded as text in PDF and not as Bitmap.

One way that I can think of doing this is by using Spire.PDF and extract location of the label, and then find the number written right below that location. This will be relatively simple if you have same template of all the PDFs you want to process.

like image 43
Riaz Raza Avatar answered Dec 28 '22 08:12

Riaz Raza