I have an asp.net Core 2.0 C#
application which read/parse the PDF file and get the text. In this I want to read specific value which have specific label name. You can see the below image I want to get the value 171857
which is Invoice
number and store it in database.
I have tried below code to read the pdf using iTextSharp
.
using (PdfReader reader = new PdfReader(fileName))
{
StringBuilder sb = new StringBuilder();
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
for (int page = 0; page < reader.NumberOfPages; page++)
{
string text = PdfTextExtractor.GetTextFromPage(reader, page + 1, strategy);
if (!string.IsNullOrWhiteSpace(text))
{
sb.Append(Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(text))));
}
}
var pdfText = sb.ToString();
}
In pdfText
variable I will get all text content from pdf but It seems that this is not the proper way to get the Invoice number. Is there any other way to read the specific content from pdf by it's label name like we will provide label name Invoice
and it will return the value 171857
as example with other 3rd party pdf reader libraries?
Any help or suggestions would be highly appreciated.
Thanks
I have helped a friend extracting similar value from pdf invoice generated by Excel arc. I have for this answer created an Excel invoice and print it as PDF file and zipped for download for testing purpose.
The next thing I do, I am using an Open Source and Free Library called PDFClown. Here is the nuget package for it.
So far so good, what I did is I scan all pdf document (for example invoice can be one page or multiple pages) add each content to a list of string.
The next step I find the index (the invoice number index could be in 10th element in list, in our case it is index 1) that refer to invoice value which I will call Tag or Label.
Hence I do not have your pdf file, I improvised and added a unique Tag called (or any other name) "INVOICE". The invoice number in this case comes after invoice tag tag. So I find the index of "INVOICE" tag and add 1 to index this is because the invoice number follow the invoice tag. This way I will pick the invoice text 0005 in this case and return it as value 5. This way you can fetch what every text/value followed by any tag scanned in our list and return it the way that you need.
So you need to play with it a bit to fit it 100% to your pdf file.
So here is my test files Excel and Pdf zipped down. Download it for your test.
Here is the code:
public class InvoiceTextExtraction
{
private List<string> _contentList;
public void GetValueFromPdf()
{
_contentList = new List<string>();
CreatePdfContent(@"C:\temp\Invoice1.pdf");
var index = _contentList.FindIndex(e => e == "INVOICE") + 1;
int.TryParse(_contentList[index], out var value);
Console.WriteLine(value);
}
public void CreatePdfContent(string filePath)
{
using (var file = new File(filePath))
{
var document = file.Document;
foreach (var page in document.Pages)
{
Extract(new ContentScanner(page));
}
}
}
private void Extract(ContentScanner level)
{
if (level == null)
return;
while (level.MoveNext())
{
var content = level.Current;
switch (content)
{
case ShowText text:
{
var font = level.State.Font;
_contentList.Add(font.Decode(text.Text));
break;
}
case Text _:
case ContainerObject _:
Extract(level.ChildLevel);
break;
}
}
}
}
Input extracted from pdf file. The code scan return following elements:
INVOICE
0005
PAYMENT DUE BY:
4/19/2019
.etc
.
.
.
Tax
USD TOTAL
171857
18 september 2019
and here is the result
5
The code is inspired from this link.
Assuming that the invoice label and invoice number is embedded as text in PDF and not as Bitmap.
One way that I can think of doing this is by using Spire.PDF and extract location of the label, and then find the number written right below that location. This will be relatively simple if you have same template of all the PDFs you want to process.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With