I have a need to search a pdf file to see if a certain string is present. The string in question is definitely encoded as text (ie. it is not an image or anything). I have tried just searching the file as though it was plain text, but this does not work.
Is it possible to do this? Are there any librarys out there for .net2.0 that will extract/decode all the text out of pdf file for me?
There are a few libraries available out there. Check out http://www.codeproject.com/KB/cs/PDFToText.aspx and http://itextsharp.sourceforge.net/
It takes a little bit of effort but it's possible.
You can use Docotic.Pdf library to search for text in PDF files.
Here is a sample code:
static void searchForText(string path, string text)
{
using (PdfDocument pdf = new PdfDocument(path))
{
for (int i = 0; i < pdf.Pages.Count; i++)
{
string pageText = pdf.Pages[i].GetText();
int index = pageText.IndexOf(text, 0, StringComparison.CurrentCultureIgnoreCase);
if (index != -1)
Console.WriteLine("'{0}' found on page {1}", text, i);
}
}
}
The library can also extract formatted and plain text from the whole document or any document page.
Disclaimer: I work for Bit Miracle, vendor of the library.
In the vast majority of cases, it's not possible to search the contents of a PDF directly by opening it up in notepad -- and even in the minority of cases (depending on how the PDF was constructed), you'll only ever be able search for individual words due to the way that PDF handles text internally.
My company has a commercial solution that will let you extract text from a PDF file. I've included some sample code for you below, as shown on this page, that demonstrates how to search through the text from a PDF file for a particular string.
using System;
using System.IO;
using QuickPDFDLL0718;
namespace QPLConsoleApp
{
public class QPL
{
public static void Main()
{
// This example uses the DLL edition of Quick PDF Library
// Create an instance of the class and give it the path to the DLL
PDFLibrary QP = new PDFLibrary("QuickPDFDLL0718.dll");
// Check if the DLL was loaded successfully
if (QP.LibraryLoaded())
{
// Insert license key here / Check the license key
if (QP.UnlockKey("...") == 1)
{
QP.LoadFromFile(@"C:\Program Files\Quick PDF Library\DLL\GettingStarted.pdf");
int iPageCount = QP.PageCount();
int PageNumber = 1;
int MatchesFound = 0;
while (PageNumber <= iPageCount)
{
QP.SelectPage(PageNumber);
string PageText = QP.GetPageText(3);
using (StreamWriter TempFile = new StreamWriter(QP.GetTempPath() + "temp" + PageNumber + ".txt"))
{
TempFile.Write(PageText);
}
string[] lines = File.ReadAllLines(QP.GetTempPath() + "temp" + PageNumber + ".txt");
string[][] grid = new string[lines.Length][];
for (int i = 0; i < lines.Length; i++)
{
grid[i] = lines[i].Split(',');
}
foreach (string[] line in grid)
{
string FindMatch = line[11];
// Update this string to the word that you're searching for.
// It can be one or more words (i.e. "sunday" or "last sunday".
if (FindMatch.Contains("characters"))
{
Console.WriteLine("Success! Word match found on page: " + PageNumber);
MatchesFound++;
}
}
PageNumber++;
}
if (MatchesFound == 0)
{
Console.WriteLine("Sorry! No matches found.");
}
else
{
Console.WriteLine();
Console.WriteLine("Total: " + MatchesFound + " matches found!");
}
Console.ReadLine();
}
}
}
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With