Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract complete hyphenated word from .pdf using acrobat.tlb in .NET VB or C#

I am parsing a .pdf using the acrobat.tlb library

Hyphenated words are being split across new lines with the hyphens removed.

e.g. ABC-123-XXX-987

Parses as:
ABC
123
XXX
987

If I parse the text using iTextSharp it parses the whole string as displayed in the file which is the behaviour I want. However, I need to highlight these strings (serial numbers) in the .pdf and iTextSharp is not placing the highlight in the correct location... hence acrobat.tlb

I am using this code, from here: http://www.vbforums.com/showthread.php?561501-RESOLVED-2003-How-to-highlight-text-in-pdf

 ' filey = "*your full file name including directory here*"
        AcroExchApp = CreateObject("AcroExch.App")
        AcroExchAVDoc = CreateObject("AcroExch.AVDoc")
        ' Open the [strfiley] pdf file
        AcroExchAVDoc.Open(filey, "")       

        ' Get the PDDoc associated with the open AVDoc
        AcroExchPDDoc = AcroExchAVDoc.GetPDDoc
        sustext = "accessorizes"
        suktext = "accessorises" 
        ' get JavaScript Object
        ' note jso is related to PDDoc of a PDF,
        jso = AcroExchPDDoc.GetJSObject
        ' count
        nCount = 0
        nCount1 = 0
        gbStop = False
        bUSCnt = False
        bUKCnt = False
        ' search for the text
        If Not jso Is Nothing Then
            ' total number of pages
            nPages = jso.numpages           

                ' Go through pages
                For i = 0 To nPages - 1
                    ' check each word in a page
                    nWords = jso.getPageNumWords(i)
                    For j = 0 To nWords - 1
                        ' get a word

                        word = Trim(CStr(jso.getPageNthWord(i, j)))
                        'If VarType(word) = VariantType.String Then
                        If word <> "" Then
                            ' compare the word with what the user wants
                            If Trim(sustext) <> "" Then
                                result = StrComp(word, sustext, vbTextCompare)
                                ' if same
                                If result = 0 Then
                                    nCount = nCount + 1
                                    If bUSCnt = False Then
                                        iUSCnt = iUSCnt + 1
                                        bUSCnt = True
                                    End If
                                End If
                            End If
                            If suktext<> "" Then
                                result1 = StrComp(word, suktext, vbTextCompare)
                                ' if same
                                If result1 = 0 Then
                                    nCount1 = nCount1 + 1
                                    If bUKCnt = False Then
                                        iUKCnt = iUKCnt + 1
                                        bUKCnt = True
                                    End If
                                End If
                            End If
                        End If
                    Next j
                Next i
jso = Nothing
        End If

The code does the job of highlighting the text, but the FOR loop with the 'word' variable is splitting the hyphenated string into component parts.

For i = 0 To nPages - 1
                        ' check each word in a page
                        nWords = jso.getPageNumWords(i)
                        For j = 0 To nWords - 1
                            ' get a word

                            word = Trim(CStr(jso.getPageNthWord(i, j)))

Does anyone know how to maintain the whole string using acrobat.tlb? My quite extensive searches have drawn a blank.

like image 476
GoodJuJu Avatar asked Sep 12 '18 09:09

GoodJuJu


People also ask

How to extract text from a PDF file?

Extracting text from a PDF document is a common task for C# and VB.NET developers. You can use Docotic.Pdf library to extract text in just a few lines of code on Windows, Linux, macOS, Android, iOS, or in a cloud environment.

How to read text from any page of a PDF file?

Following code snippet follows these steps to read text from any page of a PDF file using C#: Page page = pdfDocument. Pages [ 1 ]; page. Accept ( textAbsorber ); string extractedText = textAbsorber. Text; tw. WriteLine ( extractedText ); tw. Close (); Let us take this text extraction another step further.

How to optimize memory consumption while extracting text from PDF documents?

The following are two different approaches to optimize memory consumption while extracting text from PDF documents using C# language. Sometimes the text extraction may consume huge memory and processor. Possibly when the input file is huge and contains a lot of text. Because TextFragmentAbsorber object stores all found text fragments in the memory.

How to convert PDF to plain text in C #?

Convert PDF to plain text. You may use plain text for indexing, reading, or some kind of analysis of PDF content. This sample shows how to convert PDF to text in C#: using BitMiracle.Docotic.Pdf; using (var pdf = new PdfDocument("your_document.pdf")) { string documentText = pdf.GetText(); Console.WriteLine(documentText); }.


1 Answers

I can understand that iTextSharp is troublesome when highlighting text cause you have to draw a rectangle and becomes complicated but the solution of acrobat.tlb has its drawback also. It is not free, few people might use it. A better solution for the rest of us is the free and easy to use Spire.Pdf. You can get it from NuGet packages. The code does the folowings:

  • Opens .pdf
  • Read each text page
  • using regular expression find matches
  • save them to a list of strings eliminating duplicates
  • for each string in this list search page and highlight the word

Code:

Dim pdf As PdfDocument = New PdfDocument("Path")
Dim pattern As String = "([A-Z,0-9]{3}[-][A-Z,0-9]{3}[-][A-Z,0-9]{3}[-][A-Z,0-9]{3})"
Dim matches As MatchCollection

Dim result As PdfTextFind() = Nothing
Dim content As New StringBuilder()
Dim matchList As New List(Of String)

For Each page As PdfPageBase In pdf.Pages
    'get text from current page
    content.Append(page.ExtractText())

    'find matches
    matches = Regex.Matches(content.ToString, pattern, RegexOptions.None)

    matchList.Clear()

    'Assign each match to a string list.
    For Each match As Match In matches
        matchList.Add(match.Value)
    Next

    'Eliminate duplicates.
    matchList = matchList.Distinct.ToList

    'for each string in list
    For i = 0 To matchList.Count - 1
        'find all occurances of matchList(i) string in page and highlight it
        result = page.FindText(matchList(i)).Finds

        For Each find As PdfTextFind In result
            find.ApplyHighLight(Color.BlueViolet) 'you can set your color preference
        Next

    Next 'matchList

Next 'page

pdf.SaveToFile("New Path")

pdf.Close()
pdf.Dispose()

I am not so good in regular expression so you can implement yours. That was my approach anyway.