I've written a script in vba to scrape different movie names
and their genre
from a torrent site. Although the name
and genre
are present in it's landing page, I created the script to parse the same going one layer deep (from their main pages). To be clearer, this is one of such page what I meant by main page. My script is parsing them flawlessly. However, my intention is to do the same making asynchronous requests. Currently the script is doing it's job synchronously (in blocking manner).
In my previous post I got an answer from omegastripes
who created a script (which more or less performs like how multiprocessing works
) meant to be working asynchronously
. So that is where I found the idea but can't implement the same within the following script.
My attempt so far:
Sub GetInfo()
Const URL = "https://yts.am/browse-movies"
Dim Http As New ServerXMLHTTP60, Html As New HTMLDocument
Dim post As HTMLDivElement, oName$, oGenre$, R&
Dim I&, key As Variant, iDic As Object
Set iDic = CreateObject("Scripting.Dictionary")
With Http
.Open "GET", URL, False
.send
Html.body.innerHTML = .responseText
End With
With Html.querySelectorAll(".browse-movie-wrap .browse-movie-title")
For I = 0 To .Length - 1
iDic(.Item(I).getAttribute("href")) = 1
Next I
End With
For Each key In iDic.keys
With Http
.Open "GET", key, False
.send
Html.body.innerHTML = .responseText
End With
oName = Html.querySelector("h1").innerText
oGenre = Html.querySelector("h2").NextSibling.innerText
R = R + 1: Cells(R, 1) = oName
Cells(R, 2) = oGenre
Next key
End Sub
How can I bring about any change within my above script in order for it to be working asynchronously
?
Here is the example showing single loop parser implementation with async requests pool. The code parses all Browse Pages from the first to last and Movies Pages, both types are parsed simultaneously. Movies URLs are parsed from Browse Pages and placed in Movies Queue, then details from each Movie Page from the queue is parsed and output to the worksheet. It handles all HTTP requests errors types and makes retry until the limit.
Put the below code to a standard module:
Option Explicit
Sub Test()
Const PoolCapacity = 30 ' Async requests qty
Const MoviesMin = 55 ' Movies in queue + expected movies min qty to request new browse page
Const ReqDelayMin = 0.15 ' Min delay between requests to avoid ban, sec
Const ReqTimeout = 15 ' Request timeout, sec
Const ReqRetryMax = 3 ' Attempts for each request before quit
Dim oWS As Worksheet
Dim y As Long
Dim ocPool As Collection
Dim ocMovies As Collection
Dim lMoviesPerPage As Long
Dim lBPageIndex As Long
Dim lBPagesInPoolQty As Long
Dim bLastBPageReached As Boolean
Dim dPrevReqSent As Double
Dim i As Long
Dim bBPageInPool As Boolean
Dim dT As Double
Dim bFail As Boolean
Dim sResp As String
Dim oMatches As Object
Dim oMatch
Dim oReq As Object
Dim oRequest As cRequest
' Prepare worksheet
Set oWS = ThisWorkbook.Sheets(1)
oWS.Cells.Delete
y = 1
' Init
Set ocPool = New Collection ' Requests async pool
Set ocMovies = New Collection ' Movies urls queue
lMoviesPerPage = 20 ' Movies per page qty
lBPageIndex = 1 ' Current browse page index for request
bLastBPageReached = False ' Last page reached flag
dPrevReqSent = -60# * 60# * 24# ' Init delay timer
' Start parsing
Do
lBPagesInPoolQty = 0 ' How many browse pages currently in pool
' Check pool for all flagged and completed requests
For i = ocPool.Count To 1 Step -1
bBPageInPool = Not ocPool(i).IsMovie
' Delay from last request
dT = Timer - dPrevReqSent
If dT < 0 Then dT = dT + 60# * 60# * 24#
Select Case True
' Check request has no sent flag
Case Not ocPool(i).NeedSend
On Error Resume Next
bFail = False
sResp = ""
With ocPool(i).HTTPRequest
' Check http request is ready and status is OK
Select Case True
Case .ReadyState < 4 ' Not ready
Case .Status \ 100 <> 2 ' Wrong status
Debug.Print .Status & " / " & .StatusText & " (" & ocPool(i).URL & ")"
bFail = True
Case Else ' Ready and OK
sResp = .ResponseText
End Select
End With
If sResp = "" Then
' Request elapsed time
dT = Timer - ocPool(i).SendTimer
If dT < 0 Then dT = dT + 60# * 60# * 24#
' Check request is failed
Select Case True
Case Err.Number <> 0 ' Runtime error
Debug.Print Err.Number & " / " & Err.Description & " (" & ocPool(i).URL & ")"
bFail = True
Case dT > ReqTimeout ' Timeout
Debug.Print "Timeout (" & ocPool(i).URL & ")"
bFail = True
End Select
On Error GoTo 0
If bFail Then ' Request has been failed
ocPool(i).FailsCount = ocPool(i).FailsCount + 1
' Check attempts
If ocPool(i).FailsCount > ReqRetryMax Then
Debug.Print "Quit (" & ocPool(i).URL & ")"
ocPool.Remove i ' Quit
bBPageInPool = False
Else
ocPool(i).NeedSend = True ' Raise send flag to retry
End If
End If
Else ' Response received
If ocPool(i).IsMovie Then
' Response from movie page
With CreateObject("VBScript.RegExp")
' Parse Title, Year, Genre
' <h1 itemprop\="name">___</h1>\s*<h2>___</h2>\s*<h2>___</h2>
.Pattern = "<h1 itemprop\=""name"">([^<]*)</h1>\s*<h2>([^<]*)</h2>\s*<h2>([^<]*)</h2>"
Set oMatches = .Execute(sResp)
If oMatches.Count = 1 Then ' Output to worksheet
oWS.Cells(y, 1).Value = oMatches(0).SubMatches(0)
oWS.Cells(y, 2).Value = oMatches(0).SubMatches(1)
oWS.Cells(y, 3).Value = oMatches(0).SubMatches(2)
y = y + 1
End If
End With
Else
' Response from browse page
With CreateObject("VBScript.RegExp")
.Global = True
' Parse movies urls
' <a href="___" class="browse-movie-link">
.Pattern = "<a href=""([^""]*)"" class=""browse-movie-link"">"
Set oMatches = .Execute(sResp)
For Each oMatch In oMatches
ocMovies.Add oMatch.SubMatches(0) ' Movies queue fed
Next
' Parse next page button
' <a href="/browse-movies?page=___">Next
.Pattern = "<a href\=""/browse-movies\?page\=\d+"">Next "
bLastBPageReached = bLastBPageReached Or Not .Test(sResp)
End With
If Not bLastBPageReached Then lMoviesPerPage = oMatches.Count ' Update lMoviesPerPage
End If
ocPool.Remove i
bBPageInPool = False
End If
' Check request has send flag raised and delay enough
Case dT > ReqDelayMin
' Send the request
Set oReq = CreateObject("MSXML2.ServerXMLHTTP.6.0")
With oReq
.Open "GET", ocPool(i).URL, True
' .SetProxy 2, "190.12.55.210:46078"
.SetRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64)"
.Send
End With
ocPool(i).NeedSend = False
ocPool(i).SendTimer = Timer
dPrevReqSent = ocPool(i).SendTimer
Set ocPool(i).HTTPRequest = oReq
End Select
If bBPageInPool Then lBPagesInPoolQty = lBPagesInPoolQty + 1
DoEvents
Next
' Check if there is a room for a new request in pool
If ocPool.Count < PoolCapacity Then
' Add one new request to pool
' Check if movies in queue + expected movies are not enough
If ocMovies.Count + lBPagesInPoolQty * lMoviesPerPage < MoviesMin And Not bLastBPageReached Then
' Add new request for next browse page to feed movie queue
Set oRequest = New cRequest
With oRequest
.URL = "https://yts.am/browse-movies?page=" & lBPageIndex
.IsMovie = False
.NeedSend = True
.FailsCount = 0
End With
ocPool.Add oRequest
lBPageIndex = lBPageIndex + 1
Else
' Check if movie page urls are parsed and available in queue
If ocMovies.Count > 0 Then
' Add new request for next movie page from queue
Set oRequest = New cRequest
With oRequest
.URL = ocMovies(1)
.IsMovie = True
.NeedSend = True
.FailsCount = 0
End With
ocPool.Add oRequest
ocMovies.Remove 1
End If
End If
End If
DoEvents
Loop While ocPool.Count > 0 ' Loop until the last request completed
MsgBox "Completed"
End Sub
Put the below code to a class module named cRequest
:
Public URL As String
Public IsMovie As Boolean
Public NeedSend As Boolean
Public SendTimer As Double
Public HTTPRequest As Object
Public FailsCount As Long
Reduce delay between requests Const ReqDelayMin
with care. Once launched with a high rate for me it worked for a while and caused Cloudflare DDoS protection to trigger, and currently, I'm unable to make the code work directly from my IP, the only way is to use a proxy for the requests (you can see the commented line with .SetProxy
). Even in Chrome, I'm getting Cloudflare redirection now:
Thus the approach just reveals the question, however, the safest and much more efficient way is to use the website API as described in this answer.
This code should do the trick. It uses a MSXML2.XMLHTTP
object to handle the request.
This is the Module
code to get info:
Sub GetInfo()
On Error GoTo FailedState
If Not xmlHttpRequest Is Nothing Then Set xmlHttpRequest = Nothing
Dim MyXmlHttpHandler As CXMLHTTPHandler
Dim url As String
url = "https://yts.am/browse-movies"
Set xmlHttpRequest = New MSXML2.XMLHTTP
' Create an instance of the wrapper class.
Set MyXmlHttpHandler = New CXMLHTTPHandler
MyXmlHttpHandler.Initialize xmlHttpRequest
' Assign the wrapper class object to onreadystatechange.
xmlHttpRequest.OnReadyStateChange = MyXmlHttpHandler
' Get the page stuff asynchronously.
xmlHttpRequest.Open "GET", url, True
xmlHttpRequest.send ""
Exit Sub
FailedState:
MsgBox Err.Number & ": " & Err.Description
End Sub
And this is the class
CXMLHTTPHandler that handles the response asynchronously:
Option Explicit
Dim m_xmlHttp As MSXML2.XMLHTTP60
Public Sub Initialize(ByRef xmlHttpRequest As MSXML2.XMLHTTP60)
Set m_xmlHttp = xmlHttpRequest
End Sub
Sub OnReadyStateChange()
Debug.Print m_xmlHttp.readyState
If m_xmlHttp.readyState = 4 Then
'Now the page is loaded
'insert here your code to process the response
MsgBox m_xmlHttp.responseText 'i.e. print the response
End If
End Sub
If you want more details, look here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With