Extract complete hyphenated word from .pdf using acrobat.tlb in .NET VB or C#

Tags:

I am parsing a .pdf using the acrobat.tlb library

Hyphenated words are being split across new lines with the hyphens removed.

e.g. ABC-123-XXX-987

Parses as:
ABC
123
XXX
987

If I parse the text using iTextSharp it parses the whole string as displayed in the file which is the behaviour I want. However, I need to highlight these strings (serial numbers) in the .pdf and iTextSharp is not placing the highlight in the correct location... hence acrobat.tlb

I am using this code, from here: http://www.vbforums.com/showthread.php?561501-RESOLVED-2003-How-to-highlight-text-in-pdf

 ' filey = "*your full file name including directory here*"
        AcroExchApp = CreateObject("AcroExch.App")
        AcroExchAVDoc = CreateObject("AcroExch.AVDoc")
        ' Open the [strfiley] pdf file
        AcroExchAVDoc.Open(filey, "")       

        ' Get the PDDoc associated with the open AVDoc
        AcroExchPDDoc = AcroExchAVDoc.GetPDDoc
        sustext = "accessorizes"
        suktext = "accessorises" 
        ' get JavaScript Object
        ' note jso is related to PDDoc of a PDF,
        jso = AcroExchPDDoc.GetJSObject
        ' count
        nCount = 0
        nCount1 = 0
        gbStop = False
        bUSCnt = False
        bUKCnt = False
        ' search for the text
        If Not jso Is Nothing Then
            ' total number of pages
            nPages = jso.numpages           

                ' Go through pages
                For i = 0 To nPages - 1
                    ' check each word in a page
                    nWords = jso.getPageNumWords(i)
                    For j = 0 To nWords - 1
                        ' get a word

                        word = Trim(CStr(jso.getPageNthWord(i, j)))
                        'If VarType(word) = VariantType.String Then
                        If word <> "" Then
                            ' compare the word with what the user wants
                            If Trim(sustext) <> "" Then
                                result = StrComp(word, sustext, vbTextCompare)
                                ' if same
                                If result = 0 Then
                                    nCount = nCount + 1
                                    If bUSCnt = False Then
                                        iUSCnt = iUSCnt + 1
                                        bUSCnt = True
                                    End If
                                End If
                            End If
                            If suktext<> "" Then
                                result1 = StrComp(word, suktext, vbTextCompare)
                                ' if same
                                If result1 = 0 Then
                                    nCount1 = nCount1 + 1
                                    If bUKCnt = False Then
                                        iUKCnt = iUKCnt + 1
                                        bUKCnt = True
                                    End If
                                End If
                            End If
                        End If
                    Next j
                Next i
jso = Nothing
        End If

The code does the job of highlighting the text, but the FOR loop with the 'word' variable is splitting the hyphenated string into component parts.

For i = 0 To nPages - 1
                        ' check each word in a page
                        nWords = jso.getPageNumWords(i)
                        For j = 0 To nWords - 1
                            ' get a word

                            word = Trim(CStr(jso.getPageNthWord(i, j)))

Does anyone know how to maintain the whole string using acrobat.tlb? My quite extensive searches have drawn a blank.

476

asked Sep 12 '18 09:09

GoodJuJu

1 Answers

I can understand that iTextSharp is troublesome when highlighting text cause you have to draw a rectangle and becomes complicated but the solution of acrobat.tlb has its drawback also. It is not free, few people might use it. A better solution for the rest of us is the free and easy to use Spire.Pdf. You can get it from NuGet packages. The code does the folowings:

Opens .pdf

Read each text page

using regular expression find matches

save them to a list of strings eliminating duplicates

for each string in this list search page and highlight the word

Code:

Dim pdf As PdfDocument = New PdfDocument("Path")
Dim pattern As String = "([A-Z,0-9]{3}[-][A-Z,0-9]{3}[-][A-Z,0-9]{3}[-][A-Z,0-9]{3})"
Dim matches As MatchCollection

Dim result As PdfTextFind() = Nothing
Dim content As New StringBuilder()
Dim matchList As New List(Of String)

For Each page As PdfPageBase In pdf.Pages
    'get text from current page
    content.Append(page.ExtractText())

    'find matches
    matches = Regex.Matches(content.ToString, pattern, RegexOptions.None)

    matchList.Clear()

    'Assign each match to a string list.
    For Each match As Match In matches
        matchList.Add(match.Value)
    Next

    'Eliminate duplicates.
    matchList = matchList.Distinct.ToList

    'for each string in list
    For i = 0 To matchList.Count - 1
        'find all occurances of matchList(i) string in page and highlight it
        result = page.FindText(matchList(i)).Finds

        For Each find As PdfTextFind In result
            find.ApplyHighLight(Color.BlueViolet) 'you can set your color preference
        Next

    Next 'matchList

Next 'page

pdf.SaveToFile("New Path")

pdf.Close()
pdf.Dispose()

I am not so good in regular expression so you can implement yours. That was my approach anyway.

111

answered Oct 12 '22 20:10

γηράσκω δ' αεί πολλά διδασκόμε

Related questions
                            
                                Creating per-request controller/action based formatters in ASP.NET 5
                            
                                How to use the DTMFRecognitionEngine class in Microsoft.Speech
                            
                                How to handle a slow consumer/client connected with live Streaming server
                            
                                Single query slower than 3 queries [closed]
                            
                                Redirect User to sub domain based on IP
                            
                                How to check hard disk is Sata Device or it is IDE Device using c#
                            
                                C# Why using instance method as delegate allocates GC0 temp objects but 10% faster than a cached delegate
                            
                                Not able to use Windows Script Host Object Model in .net 3.5 project
                            
                                FxCop: Suppression message for async method
                            
                                Programmatically open Advanced Security Settings dialog?
                            
                                System.Math.Abs() calls into native method System.AppDomain.GetId()?
                            
                                C# 7 Local Functions: are attributes / aspects allowed?
                            
                                How to Select All with a One to Many Relationship Using Linq
                            
                                Exception is not being caught when program installed with a msi file
                            
                                Use list of class types (or similar concept) to limit valid input
                            
                                IIS killing ASPNET Core app
                            
                                Creating new AppDomain in F# Interactive
                            
                                MSBuild can't find package for .NET Standard project in solution with other .NET Framework projects
                            
                                System.Threading.Tasks.Task Method Not Found
                            
                                C# .NET Core 2.1 Span<T> and Memory<T> Performance Considerations

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Extract complete hyphenated word from .pdf using acrobat.tlb in .NET VB or C#

Tags:

.net

parsing

vb.net

acrobat

GoodJuJu

People also ask

1 Answers

γηράσκω δ' αεί πολλά διδασκόμε

Recent Activity

Donate For Us