Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Indy - IdHttp how to handle page redirects?

Using: Delphi 2010, latest version of Indy

I am trying to scrape the data off Googles Adsense web page, with an aim to get the reports. However I have been unsuccessful so far. It stops after the first request and does not proceed.

Using Fiddler to debug the traffic/requests to Google Adsense website, and a web browser to load the Adsense page, I can see that the request (from the webbrowser) generates a number of redirects until the page is loaded.

However, my Delphi application is only generating a couple of requests before it stops.

Here are the steps I have followed:

  1. Drop a IdHTTP and a IdSSLIOHandlerSocketOpenSSL1 component on the form.
  2. Set the IdHTTP component properties AllowCookies and HandleRedirects to True, and IOHandler property to the IdSSLIOHandlerSocketOpenSSL1.
  3. Set the IdSSLIOHandlerSocketOpenSSL1 component property Method := 'sslvSSLv23'

Finally I have this code:

procedure TfmMain.GetUrlToFile(AURL, AFile : String);
var
 Output : TMemoryStream;
begin
  Output := TMemoryStream.Create;
  try
    IdHTTP1.Get(FURL, Output);
    Output.SaveToFile(AFile);
  finally
    Output.Free;
  end;
end;

However, it does not get to the login page as expected. I would expect it to behave as if it was a webbrowser and proceed through the redirects until it finds the final page.

This is the output of the headers from Fiddler:

HTTP/1.1 302 Found
Location: https://encrypted.google.com/
Cache-Control: private
Content-Type: text/html; charset=UTF-8
Set-Cookie: PREF=ID=5166063f01b64b03:FF=0:TM=1293571783:LM=1293571783:S=a5OtsOqxu_GiV3d6; expires=Thu, 27-Dec-2012 21:29:43 GMT; path=/; domain=.google.com
Set-Cookie: NID=42=XFUwZdkyF0TJKmoJjqoGgYNtGyOz-Irvz7ivao2z0--pCBKPpAvCGUeaa5GXLneP41wlpse-yU5UuC57pBfMkv434t7XB1H68ET0ZgVDNEPNmIVEQRVj7AA1Lnvv2Aez; expires=Wed, 29-Jun-2011 21:29:43 GMT; path=/; domain=.google.com; HttpOnly
Date: Tue, 28 Dec 2010 21:29:43 GMT
Server: gws
Content-Length: 226
X-XSS-Protection: 1; mode=block

Firstly, is there anything wrong with this output?

Is there something more that I should do to get the IdHTTP component to keep pursuing the redirects until the final page?

like image 483
SteveL Avatar asked Dec 28 '10 21:12

SteveL


2 Answers

IdHTTP component property values prior to making the call:

    Name := 'IdHTTP1';
    IOHandler := IdSSLIOHandlerSocketOpenSSL1;
    AllowCookies := True;
    HandleRedirects := True;
    RedirectMaximum := 35;
    Request.UserAgent := 
      'Mozilla/5.0 (Windows NT 5.1; rv:2.0b8) Gecko/20100101 Firefox/4.' +
      '0b8';
    HTTPOptions := [hoForceEncodeParams];
    OnRedirect := IdHTTP1Redirect;
    CookieManager := IdCookieManager1;

Redirect event handler:

procedure TfmMain.IdHTTP1Redirect(Sender: TObject; var dest: string; var
    NumRedirect: Integer; var Handled: Boolean; var VMethod: string);
begin
   Handled := True;
end;

Making the call:

  FURL := 'https://www.google.com';

  GetUrlToFile( (FURL + '/adsense/'), 'a.html');




  procedure TfmMain.GetUrlToFile(AURL, AFile : String);
  var
   Output : TMemoryStream;
  begin
    Output := TMemoryStream.Create;
    try
      try
       IdHTTP1.Get(AURL, Output);
       IdHTTP1.Disconnect;
      except

      end;
      Output.SaveToFile(AFile);
    finally
      Output.Free;
    end;
  end;





Here's the (request and response headers) output from Fiddler:

alt text

like image 56
SteveL Avatar answered Oct 08 '22 17:10

SteveL


Getting redirects going

TIdHTTP.HandleRedirects := True so it starts automatically handling redirects.

TIdHTTP.RedirectMaximum is used to set how many successive redirects should be handled.


Alternatively you may assign TIdHTTP.OnRedirect and set Handled := True from that handler. This is what I'm doing in a project that needs to read data from a WikiMedia web site (my own site).

About the HTTP response

Nothing wrong with that response, it's a very basic redirect to https://encrypted.google.com/. TIdHTTP should go to the given page in response. It also sets some cookies.

Other suggestions

Don't forget to assign an CookieManager and make sure you use the same CookieManager for all subsequent requests. If you don't you'll probably get redirected to the login page over and over again.

like image 30
Cosmin Prund Avatar answered Oct 08 '22 18:10

Cosmin Prund