Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

VBA Code Scraper not placing data in right columns

Tags:

excel

vba

The code works fine, but I need it to extract ONLY emails and URLs and place the email in Sheet1 "Scraper" NEXT BLANK ROW

Emails =  Column A
Urls =  Column B

Currently it extracts anything text, emails or URL and places them in column A or B.

I only need Emails or URLs. I have been stuck on this for sometime and can't seem to work it out

Also I am not sure if my DELETE DUPLICATES is deleting duplicate rows or duplicates in column. It SHOULD be duplicate rows.

How the code works:

On Sheet2 "URL List" I have a list of URLs, the code runs through this and places the results onto Sheet1 "Scraper". and deletes any duplicates

It is only supposed to scraper email and URLs and place them in Column A,B on NEXT BLANK ROW.

I have tried to fix the problem but it is out of my scope.

Private Sub fbStart_Click()
'Set sheet2 URL List and open Internet Explorer
    Dim lr          As Long
    Dim x           As Long
    Dim arr()       As Variant
    Dim wks         As Worksheet
    Dim ie          As Object
    Dim dd(1 To 2)  As String
    Dim Fr          As Long
    
    On Error Resume Next
    Application.ScreenUpdating = False
        
    Set wks = ThisWorkbook.Sheets("Url List")
    With wks
        Fr = .Cells(.Rows.Count, 6).End(xlUp).Offset(1).Row
        lr = .Cells(.Rows.Count, 1).End(xlUp).Row
        .Cells(1, 5).Value = lr
        arr = .Range(.Cells(Fr, 1), .Cells(lr, 1)).Value
    End With
        
    'Show Internet Explorer and add delay in seconds if needed
    Set ie = CreateObject("InternetExplorer.Application")
    With ie
        .Visible = True
        Application.Wait Now + TimeValue("0:00:0")
        
        For x = LBound(arr, 1) To UBound(arr, 1)
            .navigate arr(x, 1)
            wtime = Time
            Do While .Busy Or .readyState <> 4
                DoEvents
                
            'Skip pages with Captchas  + write the word Captcha in Sheet 2 Column C
                If Time > (wtime + TimeValue("00:00:10")) Then
                    Cells(x + 1, "C").Value = "Captcha"
                    Exit Do
                End If
            Loop
            
            On Error Resume Next
            'Variable for document or data which need to be extracted out of webpage, change innertext number if same class used
            Dim doc As HTMLDocument
            Set doc = ie.document
            dd(1) = doc.getElementsByClassName("_50f4")(2).innerText
            dd(2) = doc.getElementsByClassName("_50f4")(3).innerText
            
           
            'Paste the web data into Sheet1 "Scraper" in next BLANK ROW
            With Sheet1
                .Cells(.Rows.Count, "A").End(xlUp).Offset(1, 0).Resize(, 2).Value = dd
            End With
            
            ' Put A number 1 in Sheet2 "Url List"column B to notify this URL is done
            Sheets("Url List").Range("B" & Rows.Count).End(xlUp).Offset(1, 0).Value = 1
            
             'Deletes duplicates in column A Sheet1
            Columns(1).RemoveDuplicates Columns:=Array(1)
            Columns(2).RemoveDuplicates Columns:=Array(1)
            
             'Count No1 in sheet2 Column B
            With Worksheets("Url List")
                Lastrow = .Cells(.Rows.Count, "B").End(xlUp).Row
                Sheets("Url List").Range("B1").Value = Lastrow
            End With
            Call Autoclick_Click
        Next x
       .Quit
    End With
    
    'Hide FaceBook Scraper Form
    ScraperForm.Hide
   
End Sub
like image 249
Sharid Avatar asked May 25 '19 12:05

Sharid


1 Answers

Below is to show you how to handle finding email and website address. You already have your loop and de-duplicate. Below are helper methods for extracting the required info. You can simply assign from the variables email and website to your cells in loop. I show a method using a helper function to determine lastRow in target sheet and writing out variables to correct columns in one go.

I can help with implementing the loop integration if needed, but the emphasis here was on explaining what could be done for identifying those elements of interest and how to write out to correct columns. Tbh - de-duplicating is so easily done in sheet at end but you can also use macro recorder to get perfectly functional code for that single step/use existing SO answers.


tl;dr;

This would be a lot easier if :contains / :has css pseudo classes were permitted. Instead, my approach is as follows:

  1. email - find the href attribute whose value starts with mailto

image

  1. website - check that there is a website icon on the page

enter image description here

Specify a parent of both the website icon and the website address

enter image description here

Loop all matches to that parent specification checking if contains the website icon (this is where pseudo class selectors would have simplified things). If match found then we have the shared parent of both icon and hopefully website address; use childOfSiblingCssSelector (we are looking at a child of the following div in this case) css selector to extract the website url.


Notes:

  1. The entire thing is kept quite high level/generic such that you can adjust your css selectors to hopefully cater for different scenarios. Consequence - may seem a little verbose.
  2. Helper functions are provided to handle element matching. Name these in a way that makes sense for what they are doing. I think some room for improvement here.
  3. Whilst technically the second helper, GetText, could be used to extract the email address (I'd probably add another argument to function call to specify attribute to extract) as well as website address, it seems far quicker, currently, to simply target the appropriate href as detailed above.
  4. I have kept the css selectors as local variables close to their usage; you could have them as constants, closer to top of module, where easier to access perhaps? Unsure without seeing how this performs over time/different urls
  5. Css selectors are chosen over .getElementsBy methods for two reasons: 1) there is browser optimization for css selectors so, if well formulated, css will be faster 2) I want to preserve the flexibility of the code/helper functions - you have far more specificity with css selectors in terms of what patterns you can express. I deemed this important as I don't know what future cases you may need to handle.
  6. I am deliberately not using class name and index e.g. doc.getElementsByClassName("_50f4")(2).innerText, as I am unfamiliar with the range of potential use cases; this feels a little fragile as relies on consistent ordering and numbering of elements (at least up to these indices).

TODO:

  1. Rather than instantiate a new HTMLDocument each time in GetText, it is more efficient to pass another HTMLDocument argument in the function signature i.e. from calling procedure. A re-factor could take that into consideration.
  2. This type of coding might lend itself to being class based in the future. Particularly if error handling is to be added and further functions.

VBA:

Option Explicit
'VBE > Tools > References > HTML Object Library
Public Sub test()
    Dim ie As Object, ws As Worksheet
    Set ie = CreateObject("InternetExplorer.Application")
    Set ws = ThisWorkbook.Worksheets("Scraper")

    With ie
        .Visible = True
        .Navigate2 "https://www.facebook.com/pg/SalemFordNH/about/?ref=page_internal%5Blink%5D"

        While .Busy Or .readyState < 4: DoEvents: Wend

        With .document

            Dim email As String, website As String, iconCssSelector As String
            'iconCssSelector for website icon in this instance
            iconCssSelector = "[src='https://static.xx.fbcdn.net/rsrc.php/v3/yV/r/EaDvTjOwxIV.png']"

            If ElementIsPresent(ie.document, "[href^=mailto]") Then
                email = ie.document.querySelector("[href^=mailto]").innerText
            Else
                email = "Not found"
            End If

            Dim parents As Object, sharedParentCssSelector As String, childOfSiblingCssSelector As String
            sharedParentCssSelector = "._5aj7" 'target parent of both icon and the website link
            childOfSiblingCssSelector = "._50f4" '< to target website address after finding right parent

            If ElementIsPresent(ie.document, iconCssSelector) _
                And ElementIsPresent(ie.document, sharedParentCssSelector) Then

                Set parents = ie.document.querySelectorAll(sharedParentCssSelector) 'css selector used to allow for greater flexibility in element matching
                website = GetText(ie.document, parents, iconCssSelector, childOfSiblingCssSelector)
            Else
                website = "Not found"
            End If
        End With
        'Assumes headers already present
        Dim nextRow As Long
        nextRow = GetLastRow(ws, 1) + 1
        ws.Cells(nextRow, 1).Resize(1, 2) = Array(email, website)
        .Quit
    End With
End Sub

Public Function ElementIsPresent(ByVal document As HTMLDocument, ByVal cssSelector As String) As Boolean
    ElementIsPresent = document.querySelectorAll(cssSelector).length > 0
End Function

Public Function GetText(ByVal document As HTMLDocument, ByVal parents As Object, ByVal iconCssSelector As String, ByVal childOfSiblingCssSelector As String) As String
    'in this instance and with microsoft IE DOM you cannot select for parent of an element with pseudo class _
    of :has(>child); nor use :contains... instead pass expected parent selector, that houses _
    both the icon element for website and the website address itself, and loop all matches checking for website icon _
    if found use childOfSiblingCssSelector to extract
    Dim i As Long, html As HTMLDocument
    Set html = New HTMLDocument

    For i = 0 To parents.length - 1
        html.body.innerHTML = parents.item(i).innerHTML
        If ElementIsPresent(html, iconCssSelector) Then
            GetText = html.querySelector(childOfSiblingCssSelector).innerText
            Exit Function
        End If
    Next
    GetText = "Not found"
End Function

Public Function GetLastRow(ByVal ws As Worksheet, Optional ByVal columnNumber As Long = 1) As Long
    With ws
        GetLastRow = .Cells(.rows.Count, columnNumber).End(xlUp).Row
    End With
End Function

Project references (VBE > Tools > References):

  1. Microsoft HTML Object Library

Additional reading:

  1. https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors
  2. https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelectorAll
  3. https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelector

Edit:

Example of loop - assumes no empty rows in column A between urls.

Option Explicit
'VBE > Tools > References > HTML Object Library
Public Sub test()
    Dim ie As Object, ws As Worksheet, wsUrls As Worksheet, urls()
    Set ie = CreateObject("InternetExplorer.Application")
    Set ws = ThisWorkbook.Worksheets("Scraper")
    Set wsUrls = ThisWorkbook.Worksheets("Url List")

    With wsUrls
        urls = Application.Transpose(.Range("A2:A" & .Cells(.rows.Count, "A").End(xlUp).Row).Value)
    End With
    Dim results(), r As Long
    ReDim results(1 To UBound(urls), 1 To 2)

    With ie
        .Visible = True

        For r = LBound(urls) To UBound(urls)
            .Navigate2 urls(r)

            While .Busy Or .readyState < 4: DoEvents: Wend

            With .document

                Dim email As String, website As String, iconCssSelector As String
                'iconCssSelector for website icon in this instance
                iconCssSelector = "[src='https://static.xx.fbcdn.net/rsrc.php/v3/yV/r/EaDvTjOwxIV.png']"

                If ElementIsPresent(ie.document, "[href^=mailto]") Then
                    email = ie.document.querySelector("[href^=mailto]").innerText
                Else
                    email = "Not found"
                End If

                Dim parents As Object, sharedParentCssSelector As String, childOfSiblingCssSelector As String
                sharedParentCssSelector = "._5aj7" 'target parent of both icon and the website link
                childOfSiblingCssSelector = "._50f4" '< to target website address after finding right parent

                If ElementIsPresent(ie.document, iconCssSelector) _
        And ElementIsPresent(ie.document, sharedParentCssSelector) Then

                    Set parents = ie.document.querySelectorAll(sharedParentCssSelector) 'css selector used to allow for greater flexibility in element matching
                    website = GetText(ie.document, parents, iconCssSelector, childOfSiblingCssSelector)
                Else
                    website = "Not found"
                End If
            End With
            'Assumes headers already present
            Dim nextRow As Long
            results(r, 1) = email
            results(r, 2) = website
        Next
        .Quit
    End With
    nextRow = GetLastRow(ws, 1) + 1
    ws.Cells(nextRow, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
End Sub

Public Function ElementIsPresent(ByVal document As HTMLDocument, ByVal cssSelector As String) As Boolean
    ElementIsPresent = document.querySelectorAll(cssSelector).length > 0
End Function

Public Function GetText(ByVal document As HTMLDocument, ByVal parents As Object, ByVal iconCssSelector As String, ByVal childOfSiblingCssSelector As String) As String
    'in this instance and with microsoft IE DOM you cannot select for parent of an element with pseudo class _
    of :has(>child); nor use :contains... instead pass expected parent selector, that houses _
    both the icon element for website and the website address itself, and loop all matches checking for website icon _
    if found use childOfSiblingCssSelector to extract
    Dim i As Long, html As HTMLDocument
    Set html = New HTMLDocument

    For i = 0 To parents.length - 1
        html.body.innerHTML = parents.item(i).innerHTML
        If ElementIsPresent(html, iconCssSelector) Then
            GetText = html.querySelector(childOfSiblingCssSelector).innerText
            Exit Function
        End If
    Next
    GetText = "Not found"
End Function

Public Function GetLastRow(ByVal ws As Worksheet, Optional ByVal columnNumber As Long = 1) As Long
    With ws
        GetLastRow = .Cells(.rows.Count, columnNumber).End(xlUp).Row
    End With
End Function
like image 113
QHarr Avatar answered Nov 03 '22 17:11

QHarr