The code works fine, but I need it to extract ONLY emails and URLs and place the email in Sheet1
"Scraper" NEXT BLANK ROW
Emails = Column A
Urls = Column B
Currently it extracts anything text, emails or URL and places them in column A
or B
.
I only need Emails or URLs. I have been stuck on this for sometime and can't seem to work it out
Also I am not sure if my DELETE DUPLICATES is deleting duplicate rows or duplicates in column. It SHOULD be duplicate rows.
On Sheet2
"URL List" I have a list of URLs, the code runs through this and places the results onto Sheet1
"Scraper". and deletes any duplicates
It is only supposed to scraper email and URLs and place them in Column A
,B
on NEXT BLANK ROW.
I have tried to fix the problem but it is out of my scope.
Private Sub fbStart_Click()
'Set sheet2 URL List and open Internet Explorer
Dim lr As Long
Dim x As Long
Dim arr() As Variant
Dim wks As Worksheet
Dim ie As Object
Dim dd(1 To 2) As String
Dim Fr As Long
On Error Resume Next
Application.ScreenUpdating = False
Set wks = ThisWorkbook.Sheets("Url List")
With wks
Fr = .Cells(.Rows.Count, 6).End(xlUp).Offset(1).Row
lr = .Cells(.Rows.Count, 1).End(xlUp).Row
.Cells(1, 5).Value = lr
arr = .Range(.Cells(Fr, 1), .Cells(lr, 1)).Value
End With
'Show Internet Explorer and add delay in seconds if needed
Set ie = CreateObject("InternetExplorer.Application")
With ie
.Visible = True
Application.Wait Now + TimeValue("0:00:0")
For x = LBound(arr, 1) To UBound(arr, 1)
.navigate arr(x, 1)
wtime = Time
Do While .Busy Or .readyState <> 4
DoEvents
'Skip pages with Captchas + write the word Captcha in Sheet 2 Column C
If Time > (wtime + TimeValue("00:00:10")) Then
Cells(x + 1, "C").Value = "Captcha"
Exit Do
End If
Loop
On Error Resume Next
'Variable for document or data which need to be extracted out of webpage, change innertext number if same class used
Dim doc As HTMLDocument
Set doc = ie.document
dd(1) = doc.getElementsByClassName("_50f4")(2).innerText
dd(2) = doc.getElementsByClassName("_50f4")(3).innerText
'Paste the web data into Sheet1 "Scraper" in next BLANK ROW
With Sheet1
.Cells(.Rows.Count, "A").End(xlUp).Offset(1, 0).Resize(, 2).Value = dd
End With
' Put A number 1 in Sheet2 "Url List"column B to notify this URL is done
Sheets("Url List").Range("B" & Rows.Count).End(xlUp).Offset(1, 0).Value = 1
'Deletes duplicates in column A Sheet1
Columns(1).RemoveDuplicates Columns:=Array(1)
Columns(2).RemoveDuplicates Columns:=Array(1)
'Count No1 in sheet2 Column B
With Worksheets("Url List")
Lastrow = .Cells(.Rows.Count, "B").End(xlUp).Row
Sheets("Url List").Range("B1").Value = Lastrow
End With
Call Autoclick_Click
Next x
.Quit
End With
'Hide FaceBook Scraper Form
ScraperForm.Hide
End Sub
Below is to show you how to handle finding email and website address. You already have your loop and de-duplicate. Below are helper methods for extracting the required info. You can simply assign from the variables email
and website
to your cells in loop. I show a method using a helper function to determine lastRow in target sheet and writing out variables to correct columns in one go.
I can help with implementing the loop integration if needed, but the emphasis here was on explaining what could be done for identifying those elements of interest and how to write out to correct columns. Tbh - de-duplicating is so easily done in sheet at end but you can also use macro recorder to get perfectly functional code for that single step/use existing SO answers.
tl;dr;
This would be a lot easier if :contains
/ :has
css pseudo classes were permitted. Instead, my approach is as follows:
href
attribute whose value starts with mailto
Specify a parent of both the website icon and the website address
Loop all matches to that parent specification checking if contains the website icon (this is where pseudo class selectors would have simplified things). If match found then we have the shared parent of both icon and hopefully website address; use childOfSiblingCssSelector
(we are looking at a child of the following div in this case) css selector to extract the website url.
Notes:
GetText
, could be used to extract the email address (I'd probably add another argument to function call to specify attribute to extract) as well as website address, it seems far quicker, currently, to simply target the appropriate href
as detailed above..getElementsBy
methods for two reasons: 1) there is browser optimization for css selectors so, if well formulated, css will be faster 2) I want to preserve the flexibility of the code/helper functions - you have far more specificity with css selectors in terms of what patterns you can express. I deemed this important as I don't know what future cases you may need to handle.doc.getElementsByClassName("_50f4")(2).innerText
, as I am unfamiliar with the range of potential use cases; this feels a little fragile as relies on consistent ordering and numbering of elements (at least up to these indices).TODO:
HTMLDocument
each time in GetText
, it is more efficient to pass another HTMLDocument
argument in the function signature i.e. from calling procedure. A re-factor could take that into consideration.VBA:
Option Explicit
'VBE > Tools > References > HTML Object Library
Public Sub test()
Dim ie As Object, ws As Worksheet
Set ie = CreateObject("InternetExplorer.Application")
Set ws = ThisWorkbook.Worksheets("Scraper")
With ie
.Visible = True
.Navigate2 "https://www.facebook.com/pg/SalemFordNH/about/?ref=page_internal%5Blink%5D"
While .Busy Or .readyState < 4: DoEvents: Wend
With .document
Dim email As String, website As String, iconCssSelector As String
'iconCssSelector for website icon in this instance
iconCssSelector = "[src='https://static.xx.fbcdn.net/rsrc.php/v3/yV/r/EaDvTjOwxIV.png']"
If ElementIsPresent(ie.document, "[href^=mailto]") Then
email = ie.document.querySelector("[href^=mailto]").innerText
Else
email = "Not found"
End If
Dim parents As Object, sharedParentCssSelector As String, childOfSiblingCssSelector As String
sharedParentCssSelector = "._5aj7" 'target parent of both icon and the website link
childOfSiblingCssSelector = "._50f4" '< to target website address after finding right parent
If ElementIsPresent(ie.document, iconCssSelector) _
And ElementIsPresent(ie.document, sharedParentCssSelector) Then
Set parents = ie.document.querySelectorAll(sharedParentCssSelector) 'css selector used to allow for greater flexibility in element matching
website = GetText(ie.document, parents, iconCssSelector, childOfSiblingCssSelector)
Else
website = "Not found"
End If
End With
'Assumes headers already present
Dim nextRow As Long
nextRow = GetLastRow(ws, 1) + 1
ws.Cells(nextRow, 1).Resize(1, 2) = Array(email, website)
.Quit
End With
End Sub
Public Function ElementIsPresent(ByVal document As HTMLDocument, ByVal cssSelector As String) As Boolean
ElementIsPresent = document.querySelectorAll(cssSelector).length > 0
End Function
Public Function GetText(ByVal document As HTMLDocument, ByVal parents As Object, ByVal iconCssSelector As String, ByVal childOfSiblingCssSelector As String) As String
'in this instance and with microsoft IE DOM you cannot select for parent of an element with pseudo class _
of :has(>child); nor use :contains... instead pass expected parent selector, that houses _
both the icon element for website and the website address itself, and loop all matches checking for website icon _
if found use childOfSiblingCssSelector to extract
Dim i As Long, html As HTMLDocument
Set html = New HTMLDocument
For i = 0 To parents.length - 1
html.body.innerHTML = parents.item(i).innerHTML
If ElementIsPresent(html, iconCssSelector) Then
GetText = html.querySelector(childOfSiblingCssSelector).innerText
Exit Function
End If
Next
GetText = "Not found"
End Function
Public Function GetLastRow(ByVal ws As Worksheet, Optional ByVal columnNumber As Long = 1) As Long
With ws
GetLastRow = .Cells(.rows.Count, columnNumber).End(xlUp).Row
End With
End Function
Project references (VBE > Tools > References):
Additional reading:
Example of loop - assumes no empty rows in column A between urls.
Option Explicit
'VBE > Tools > References > HTML Object Library
Public Sub test()
Dim ie As Object, ws As Worksheet, wsUrls As Worksheet, urls()
Set ie = CreateObject("InternetExplorer.Application")
Set ws = ThisWorkbook.Worksheets("Scraper")
Set wsUrls = ThisWorkbook.Worksheets("Url List")
With wsUrls
urls = Application.Transpose(.Range("A2:A" & .Cells(.rows.Count, "A").End(xlUp).Row).Value)
End With
Dim results(), r As Long
ReDim results(1 To UBound(urls), 1 To 2)
With ie
.Visible = True
For r = LBound(urls) To UBound(urls)
.Navigate2 urls(r)
While .Busy Or .readyState < 4: DoEvents: Wend
With .document
Dim email As String, website As String, iconCssSelector As String
'iconCssSelector for website icon in this instance
iconCssSelector = "[src='https://static.xx.fbcdn.net/rsrc.php/v3/yV/r/EaDvTjOwxIV.png']"
If ElementIsPresent(ie.document, "[href^=mailto]") Then
email = ie.document.querySelector("[href^=mailto]").innerText
Else
email = "Not found"
End If
Dim parents As Object, sharedParentCssSelector As String, childOfSiblingCssSelector As String
sharedParentCssSelector = "._5aj7" 'target parent of both icon and the website link
childOfSiblingCssSelector = "._50f4" '< to target website address after finding right parent
If ElementIsPresent(ie.document, iconCssSelector) _
And ElementIsPresent(ie.document, sharedParentCssSelector) Then
Set parents = ie.document.querySelectorAll(sharedParentCssSelector) 'css selector used to allow for greater flexibility in element matching
website = GetText(ie.document, parents, iconCssSelector, childOfSiblingCssSelector)
Else
website = "Not found"
End If
End With
'Assumes headers already present
Dim nextRow As Long
results(r, 1) = email
results(r, 2) = website
Next
.Quit
End With
nextRow = GetLastRow(ws, 1) + 1
ws.Cells(nextRow, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
End Sub
Public Function ElementIsPresent(ByVal document As HTMLDocument, ByVal cssSelector As String) As Boolean
ElementIsPresent = document.querySelectorAll(cssSelector).length > 0
End Function
Public Function GetText(ByVal document As HTMLDocument, ByVal parents As Object, ByVal iconCssSelector As String, ByVal childOfSiblingCssSelector As String) As String
'in this instance and with microsoft IE DOM you cannot select for parent of an element with pseudo class _
of :has(>child); nor use :contains... instead pass expected parent selector, that houses _
both the icon element for website and the website address itself, and loop all matches checking for website icon _
if found use childOfSiblingCssSelector to extract
Dim i As Long, html As HTMLDocument
Set html = New HTMLDocument
For i = 0 To parents.length - 1
html.body.innerHTML = parents.item(i).innerHTML
If ElementIsPresent(html, iconCssSelector) Then
GetText = html.querySelector(childOfSiblingCssSelector).innerText
Exit Function
End If
Next
GetText = "Not found"
End Function
Public Function GetLastRow(ByVal ws As Worksheet, Optional ByVal columnNumber As Long = 1) As Long
With ws
GetLastRow = .Cells(.rows.Count, columnNumber).End(xlUp).Row
End With
End Function
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With