I see many questions and answers about using C# to generate PDF files.
I have a related, but different task.
I have a large number of PDF files already created, and I would like to validate certain parts of the content with Regular Expressions (RegExs). I want to open the PDFs in C#, and be able to read out the text in something approaching a linear fashion.
If headers, footers, any sidebars, etc, get skipped or read out of order, it doesn't matter. I'm just after as much of the main-body text as I can retrieve.
Can you point me towards tools, libraries, API's, etc, that will enable me to programmatically read text in PDF files?
I have used PDFSharp not later than last automn and found it very easy to use in comparison to others. Home page for PDFSharp.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With