Using: Delphi 2010, latest version of Indy
I am trying to scrape the data off Googles Adsense web page, with an aim to get the reports. However I have been unsuccessful so far. It stops after the first request and does not proceed.
Using Fiddler to debug the traffic/requests to Google Adsense website, and a web browser to load the Adsense page, I can see that the request (from the webbrowser) generates a number of redirects until the page is loaded.
However, my Delphi application is only generating a couple of requests before it stops.
Here are the steps I have followed:
Finally I have this code:
procedure TfmMain.GetUrlToFile(AURL, AFile : String);
var
Output : TMemoryStream;
begin
Output := TMemoryStream.Create;
try
IdHTTP1.Get(FURL, Output);
Output.SaveToFile(AFile);
finally
Output.Free;
end;
end;
However, it does not get to the login page as expected. I would expect it to behave as if it was a webbrowser and proceed through the redirects until it finds the final page.
This is the output of the headers from Fiddler:
HTTP/1.1 302 Found Location: https://encrypted.google.com/ Cache-Control: private Content-Type: text/html; charset=UTF-8 Set-Cookie: PREF=ID=5166063f01b64b03:FF=0:TM=1293571783:LM=1293571783:S=a5OtsOqxu_GiV3d6; expires=Thu, 27-Dec-2012 21:29:43 GMT; path=/; domain=.google.com Set-Cookie: NID=42=XFUwZdkyF0TJKmoJjqoGgYNtGyOz-Irvz7ivao2z0--pCBKPpAvCGUeaa5GXLneP41wlpse-yU5UuC57pBfMkv434t7XB1H68ET0ZgVDNEPNmIVEQRVj7AA1Lnvv2Aez; expires=Wed, 29-Jun-2011 21:29:43 GMT; path=/; domain=.google.com; HttpOnly Date: Tue, 28 Dec 2010 21:29:43 GMT Server: gws Content-Length: 226 X-XSS-Protection: 1; mode=block
Firstly, is there anything wrong with this output?
Is there something more that I should do to get the IdHTTP component to keep pursuing the redirects until the final page?
IdHTTP component property values prior to making the call:
Name := 'IdHTTP1';
IOHandler := IdSSLIOHandlerSocketOpenSSL1;
AllowCookies := True;
HandleRedirects := True;
RedirectMaximum := 35;
Request.UserAgent :=
'Mozilla/5.0 (Windows NT 5.1; rv:2.0b8) Gecko/20100101 Firefox/4.' +
'0b8';
HTTPOptions := [hoForceEncodeParams];
OnRedirect := IdHTTP1Redirect;
CookieManager := IdCookieManager1;
Redirect event handler:
procedure TfmMain.IdHTTP1Redirect(Sender: TObject; var dest: string; var
NumRedirect: Integer; var Handled: Boolean; var VMethod: string);
begin
Handled := True;
end;
Making the call:
FURL := 'https://www.google.com';
GetUrlToFile( (FURL + '/adsense/'), 'a.html');
procedure TfmMain.GetUrlToFile(AURL, AFile : String);
var
Output : TMemoryStream;
begin
Output := TMemoryStream.Create;
try
try
IdHTTP1.Get(AURL, Output);
IdHTTP1.Disconnect;
except
end;
Output.SaveToFile(AFile);
finally
Output.Free;
end;
end;
Here's the (request and response headers) output from Fiddler:
TIdHTTP.HandleRedirects := True
so it starts automatically handling redirects.
TIdHTTP.RedirectMaximum
is used to set how many successive redirects should be handled.
Alternatively you may assign TIdHTTP.OnRedirect
and set Handled := True
from that handler. This is what I'm doing in a project that needs to read data from a WikiMedia web site (my own site).
Nothing wrong with that response, it's a very basic redirect to https://encrypted.google.com/. TIdHTTP should go to the given page in response. It also sets some cookies.
Don't forget to assign an CookieManager
and make sure you use the same CookieManager
for all subsequent requests. If you don't you'll probably get redirected to the login page over and over again.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With