I am parsing a .pdf using the acrobat.tlb library
Hyphenated words are being split across new lines with the hyphens removed.
e.g. ABC-123-XXX-987
Parses as:
ABC
123
XXX
987
If I parse the text using iTextSharp it parses the whole string as displayed in the file which is the behaviour I want. However, I need to highlight these strings (serial numbers) in the .pdf and iTextSharp is not placing the highlight in the correct location... hence acrobat.tlb
I am using this code, from here: http://www.vbforums.com/showthread.php?561501-RESOLVED-2003-How-to-highlight-text-in-pdf
' filey = "*your full file name including directory here*"
AcroExchApp = CreateObject("AcroExch.App")
AcroExchAVDoc = CreateObject("AcroExch.AVDoc")
' Open the [strfiley] pdf file
AcroExchAVDoc.Open(filey, "")
' Get the PDDoc associated with the open AVDoc
AcroExchPDDoc = AcroExchAVDoc.GetPDDoc
sustext = "accessorizes"
suktext = "accessorises"
' get JavaScript Object
' note jso is related to PDDoc of a PDF,
jso = AcroExchPDDoc.GetJSObject
' count
nCount = 0
nCount1 = 0
gbStop = False
bUSCnt = False
bUKCnt = False
' search for the text
If Not jso Is Nothing Then
' total number of pages
nPages = jso.numpages
' Go through pages
For i = 0 To nPages - 1
' check each word in a page
nWords = jso.getPageNumWords(i)
For j = 0 To nWords - 1
' get a word
word = Trim(CStr(jso.getPageNthWord(i, j)))
'If VarType(word) = VariantType.String Then
If word <> "" Then
' compare the word with what the user wants
If Trim(sustext) <> "" Then
result = StrComp(word, sustext, vbTextCompare)
' if same
If result = 0 Then
nCount = nCount + 1
If bUSCnt = False Then
iUSCnt = iUSCnt + 1
bUSCnt = True
End If
End If
End If
If suktext<> "" Then
result1 = StrComp(word, suktext, vbTextCompare)
' if same
If result1 = 0 Then
nCount1 = nCount1 + 1
If bUKCnt = False Then
iUKCnt = iUKCnt + 1
bUKCnt = True
End If
End If
End If
End If
Next j
Next i
jso = Nothing
End If
The code does the job of highlighting the text, but the FOR loop with the 'word' variable is splitting the hyphenated string into component parts.
For i = 0 To nPages - 1
' check each word in a page
nWords = jso.getPageNumWords(i)
For j = 0 To nWords - 1
' get a word
word = Trim(CStr(jso.getPageNthWord(i, j)))
Does anyone know how to maintain the whole string using acrobat.tlb? My quite extensive searches have drawn a blank.
Extracting text from a PDF document is a common task for C# and VB.NET developers. You can use Docotic.Pdf library to extract text in just a few lines of code on Windows, Linux, macOS, Android, iOS, or in a cloud environment.
Following code snippet follows these steps to read text from any page of a PDF file using C#: Page page = pdfDocument. Pages [ 1 ]; page. Accept ( textAbsorber ); string extractedText = textAbsorber. Text; tw. WriteLine ( extractedText ); tw. Close (); Let us take this text extraction another step further.
The following are two different approaches to optimize memory consumption while extracting text from PDF documents using C# language. Sometimes the text extraction may consume huge memory and processor. Possibly when the input file is huge and contains a lot of text. Because TextFragmentAbsorber object stores all found text fragments in the memory.
Convert PDF to plain text. You may use plain text for indexing, reading, or some kind of analysis of PDF content. This sample shows how to convert PDF to text in C#: using BitMiracle.Docotic.Pdf; using (var pdf = new PdfDocument("your_document.pdf")) { string documentText = pdf.GetText(); Console.WriteLine(documentText); }.
I can understand that iTextSharp
is troublesome when highlighting text cause you have to draw a rectangle and becomes complicated but the solution of acrobat.tlb
has its drawback also. It is not free, few people might use it. A better solution for the rest of us is the free and easy to use Spire.Pdf
. You can get it from NuGet packages. The code does the folowings:
- Opens .pdf
- Read each text page
- using regular expression find matches
- save them to a list of strings eliminating duplicates
- for each string in this list search page and highlight the word
Code:
Dim pdf As PdfDocument = New PdfDocument("Path")
Dim pattern As String = "([A-Z,0-9]{3}[-][A-Z,0-9]{3}[-][A-Z,0-9]{3}[-][A-Z,0-9]{3})"
Dim matches As MatchCollection
Dim result As PdfTextFind() = Nothing
Dim content As New StringBuilder()
Dim matchList As New List(Of String)
For Each page As PdfPageBase In pdf.Pages
'get text from current page
content.Append(page.ExtractText())
'find matches
matches = Regex.Matches(content.ToString, pattern, RegexOptions.None)
matchList.Clear()
'Assign each match to a string list.
For Each match As Match In matches
matchList.Add(match.Value)
Next
'Eliminate duplicates.
matchList = matchList.Distinct.ToList
'for each string in list
For i = 0 To matchList.Count - 1
'find all occurances of matchList(i) string in page and highlight it
result = page.FindText(matchList(i)).Finds
For Each find As PdfTextFind In result
find.ApplyHighLight(Color.BlueViolet) 'you can set your color preference
Next
Next 'matchList
Next 'page
pdf.SaveToFile("New Path")
pdf.Close()
pdf.Dispose()
I am not so good in regular expression
so you can implement yours. That was my approach anyway.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With