I am trying to extract the data from a PDF document into a worksheet. The PDFs show and text can be manually copied and pasted into the Excel document. I am currently doing this through SendKeys and it is not working. I get an error when I try to paste the data from the PDF document. Why is my paste not working? If I paste after the macro has stopped running it pastes as normal. <pre class="prettyprint"><code>Dim myPath As String, myExt As String Dim ws As Worksheet Dim openPDF As Object 'Dim pasteData As MSForms.DataObject Dim fCell As Range 'Set pasteData = New MSForms.DataObject Set ws = Sheets("DATA") If ws.Cells(ws.Rows.Count, "A").End(xlUp).Row > 1 Then Range("A3:A" & ws.Cells(ws.Rows.Count, "A").End(xlUp).Row).ClearContents myExt = "\*.pdf" 'When Scan Receipts Button Pressed Scan the selected folder/s for receipts For Each fCell In Range(ws.Cells(1, 1), ws.Cells(1, ws.Cells(1, ws.Columns.Count).End(xlToLeft).Column)) myPath = Dir(fCell.Value & myExt) Do While myPath <> "" myPath = fCell.Value & "\" & myPath Set openPDF = CreateObject("Shell.Application") openPDF.Open (myPath) Application.Wait Now + TimeValue("00:00:2") SendKeys "^a" Application.Wait Now + TimeValue("00:00:2") SendKeys "^c" 'Application.Wait Now + TimeValue("00:00:2") ws.Select ActiveSheet.Paste 'pasteData.GetFromClipboard 'ws.Cells(3, 1) = pasteData.GetText Exit Sub myPath = Dir Loop Next fCell </code></pre>

I know this is an old issue but I just had to do this for a project at work, and I am very surprised that nobody has thought of this solution yet: Just open the .pdf with Microsoft word. The code is a lot easier to work with when you are trying to extract data from a .docx because it opens in Microsoft Word. Excel and Word play well together because they are both Microsoft programs. In my case, the file of question had to be a .pdf file. Here's the solution I came up with: <ol> <li>Choose the default program to open .pdf files to be Microsoft Word</li> <li>The first time you open a .pdf file with word, a dialogue box pops up claiming word will need to convert the .pdf into a .docx file. Click the check box in the bottom left stating "do not show this message again" and then click OK.</li> <li>Create a macro that extracts data from a .docx file. I used MikeD's Code as a resource for this.</li> <li>Tinker around with the MoveDown, MoveRight, and Find.Execute methods to fit the need of your task.</li> </ol> Yes you could just convert the .pdf file to a .docx file but this is a much simpler solution in my opinion.

Extract Data from PDF and Add to Worksheet

Tags:

excel

pdf

vba

I am trying to extract the data from a PDF document into a worksheet. The PDFs show and text can be manually copied and pasted into the Excel document.

I am currently doing this through SendKeys and it is not working. I get an error when I try to paste the data from the PDF document. Why is my paste not working? If I paste after the macro has stopped running it pastes as normal.

Dim myPath As String, myExt As String
Dim ws As Worksheet
Dim openPDF As Object
'Dim pasteData As MSForms.DataObject
Dim fCell As Range

'Set pasteData = New MSForms.DataObject
Set ws = Sheets("DATA")
If ws.Cells(ws.Rows.Count, "A").End(xlUp).Row > 1 Then Range("A3:A" & ws.Cells(ws.Rows.Count, "A").End(xlUp).Row).ClearContents

myExt = "\*.pdf"
'When Scan Receipts Button Pressed Scan the selected folder/s for receipts
For Each fCell In Range(ws.Cells(1, 1), ws.Cells(1, ws.Cells(1, ws.Columns.Count).End(xlToLeft).Column))
    myPath = Dir(fCell.Value & myExt)
    Do While myPath <> ""
        myPath = fCell.Value & "\" & myPath
        Set openPDF = CreateObject("Shell.Application")
        openPDF.Open (myPath)
        Application.Wait Now + TimeValue("00:00:2")
        SendKeys "^a"
        Application.Wait Now + TimeValue("00:00:2")
        SendKeys "^c"
        'Application.Wait Now + TimeValue("00:00:2")
        ws.Select
        ActiveSheet.Paste
        'pasteData.GetFromClipboard

        'ws.Cells(3, 1) = pasteData.GetText
        Exit Sub

        myPath = Dir
    Loop

Next fCell

571

asked Mar 28 '16 20:03

Will Bell

2 Answers

You can open the PDF file and extract its contents using the Adobe library (which I believe you can download from Adobe as part of the SDK, but it comes with certain versions of Acrobat as well)

Make sure to add the Library to your references too (On my machine it is the Adobe Acrobat 10.0 Type Library, but not sure if that is the newest version)

Even with the Adobe library it is not trivial (you'll need to add your own error-trapping etc):

Function getTextFromPDF(ByVal strFilename As String) As String
   Dim objAVDoc As New AcroAVDoc
   Dim objPDDoc As New AcroPDDoc
   Dim objPage As AcroPDPage
   Dim objSelection As AcroPDTextSelect
   Dim objHighlight As AcroHiliteList
   Dim pageNum As Long
   Dim strText As String

   strText = ""
   If (objAvDoc.Open(strFilename, "") Then
      Set objPDDoc = objAVDoc.GetPDDoc
      For pageNum = 0 To objPDDoc.GetNumPages() - 1
         Set objPage = objPDDoc.AcquirePage(pageNum)
         Set objHighlight = New AcroHiliteList
         objHighlight.Add 0, 10000 ' Adjust this up if it's not getting all the text on the page
         Set objSelection = objPage.CreatePageHilite(objHighlight)

         If Not objSelection Is Nothing Then
            For tCount = 0 To objSelection.GetNumText - 1
               strText = strText & objSelection.GetText(tCount)
            Next tCount
         End If
      Next pageNum
      objAVDoc.Close 1
   End If

   getTextFromPDF = strText

End Function

What this does is essentially the same thing you are trying to do - only using Adobe's own library. It's going through the PDF one page at a time, highlighting all of the text on the page, then dropping it (one text element at a time) into a string.

Keep in mind what you get from this could be full of all kinds of non-printing characters (line feeds, newlines, etc) that could even end up in the middle of what look like contiguous blocks of text, so you may need additional code to clean it up before you can use it.

Hope that helps!

answered Oct 03 '22 07:10

leowyn

I know this is an old issue but I just had to do this for a project at work, and I am very surprised that nobody has thought of this solution yet: Just open the .pdf with Microsoft word.

The code is a lot easier to work with when you are trying to extract data from a .docx because it opens in Microsoft Word. Excel and Word play well together because they are both Microsoft programs. In my case, the file of question had to be a .pdf file. Here's the solution I came up with:

Choose the default program to open .pdf files to be Microsoft Word
The first time you open a .pdf file with word, a dialogue box pops up claiming word will need to convert the .pdf into a .docx file. Click the check box in the bottom left stating "do not show this message again" and then click OK.
Create a macro that extracts data from a .docx file. I used MikeD's Code as a resource for this.
Tinker around with the MoveDown, MoveRight, and Find.Execute methods to fit the need of your task.

Yes you could just convert the .pdf file to a .docx file but this is a much simpler solution in my opinion.

answered Oct 03 '22 05:10

expodavid

Related questions
                            
                                xlsx error: "Removed Records: Named range from /xl/workbook.xml part" when tried to resolve errors
                            
                                excel vba getting the row,cell value from selection.address
                            
                                Excel Conditional Formatting based on Other Column
                            
                                Excel select a value from a cell having row number calculated
                            
                                Can anyone tell my why I'm getting the error [AttributeError: 'list' object has no attribute 'encode']
                            
                                VBA to copy Module from one Excel Workbook to another Workbook
                            
                                Perform a find on hidden cells
                            
                                Table in Excel from SQL Server stored procedure with parameter field in workbook
                            
                                How to express cell id using string letter and current row number in an Excel formula?
                            
                                How to set size of column of csv file?
                            
                                Why am I getting Error 2042 in VBA Match?
                            
                                Python openpyxl data_only=True returning None
                            
                                pandas.read_excel parameter "sheet_name" not working
                            
                                How do I Change the Sheet Name from C# on an Excel Spreadsheet
                            
                                Pass vba Dictionary
                            
                                Ignore cells on Excel line graph
                            
                                How to do an If Else in google sheets?
                            
                                Apache POI Java Excel Performance for Large Spreadsheets
                            
                                What is the VBA code to emulate selecting a block with the CTRL+A shortcut?
                            
                                Excel VBA Line Color / Marker Line Color

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With