I'm writing a web crawler for a specific site. The application is a VB.Net Windows Forms application that is not using multiple threads - each web request is consecutive. However, after ten successful page retrievals every successive request times out.
I have reviewed the similar questions already posted here on SO, and have implemented the recommended techniques into my GetPage routine, shown below:
Public Function GetPage(ByVal url As String) As String
Dim result As String = String.Empty
Dim uri As New Uri(url)
Dim sp As ServicePoint = ServicePointManager.FindServicePoint(uri)
sp.ConnectionLimit = 100
Dim request As HttpWebRequest = WebRequest.Create(uri)
request.KeepAlive = False
request.Timeout = 15000
Try
Using response As HttpWebResponse = DirectCast(request.GetResponse, HttpWebResponse)
Using dataStream As Stream = response.GetResponseStream()
Using reader As New StreamReader(dataStream)
If response.StatusCode <> HttpStatusCode.OK Then
Throw New Exception("Got response status code: " + response.StatusCode)
End If
result = reader.ReadToEnd()
End Using
End Using
response.Close()
End Using
Catch ex As Exception
Dim msg As String = "Error reading page """ & url & """. " & ex.Message
Logger.LogMessage(msg, LogOutputLevel.Diagnostics)
End Try
Return result
End Function
Have I missed something? Am I not closing or disposing of an object that should be? It seems strange that it always happens after ten consecutive requests.
Notes:
In the constructor for the class in which this method resides I have the following:
ServicePointManager.DefaultConnectionLimit = 100
If I set KeepAlive to true, the timeouts begin after five requests.
All the requests are for pages in the same domain.
EDIT
I added a delay between each web request of between two and seven seconds so that I do not appear to be "hammering" the site or attempting a DOS attack. However, the problem still occurs.
I ran into this issue today and my resolution was to ensure that the response was closed at all times.
I think that you need to put in a response.Close() before you throw your exception inside the using.
Using response As HttpWebResponse = DirectCast(request.GetResponse, HttpWebResponse)
Using dataStream As Stream = response.GetResponseStream()
Using reader As New StreamReader(dataStream)
If response.StatusCode <> HttpStatusCode.OK Then
response.Close()
Throw New Exception("Got response status code: " + response.StatusCode)
End If
result = reader.ReadToEnd()
End Using
End Using
response.Close()
End Using
I think the site has some sort of DOS protection, which kicks in when it's hit with a number of rapis requests. You may want to try setting the UserAgent on the webrequest.
I used the following solution and it works for me. Hope it helps to you too.
Declare "global" on the form the variables.
HttpWebRequest myHttpWebRequest;
HttpWebResponse myHttpWebResponse;
Then always use myHttpWebResponse.Close();
after each connection.
myHttpWebResponse = (HttpWebResponse)myHttpWebRequest.GetResponse();
myHttpWebResponse.Close();
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With