Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I Programmatically perform a search without using an API?

I would like to create a program that will enter a string into the text box on a site like Google (without using their public API) and then submit the form and grab the results. Is this possible? Grabbing the results will require the use of HTML scraping I would assume, but how would I enter data into the text field and submit the form? Would I be forced to use a public API? Is something like this just not feasible? Would I have to figure out query strings/parameters?

Thanks

like image 454
kgrad Avatar asked Jul 17 '09 01:07

kgrad


4 Answers

Theory

What I would do is create a little program that can automatically submit any form data to any place and come back with the results. This is easy to do in Java with HTTPUnit. The task goes like this:

  • Connect to the web server.
  • Parse the page.
  • Get the first form on the page.
  • Fill in the form data.
  • Submit the form.
  • Read (and parse) the results.

The solution you pick will depend on a variety of factors, including:

  • Whether you need to emulate JavaScript
  • What you need to do with the data afterwards
  • What languages with which you are proficient
  • Application speed (is this for one query or 100,000?)
  • How soon the application needs to be working
  • Is it a one off, or will it have to be maintained?

For example, you could try the following applications to submit the data for you:

  • Lynx
  • curl
  • wget

Then grep (awk, or sed) the resulting web page(s).

Another trick when screen scraping is to download a sample HTML file and parse it manually in vi (or VIM). Save the keystrokes to a file and then whenever you run the query, apply those keystrokes to the resulting web page(s) to extract the data. This solution is not maintainable, nor 100% reliable (but screen scraping from a website seldom is). It works and is fast.

Example

A semi-generic Java class to submit website forms (specifically dealing with logging into a website) is below, in the hopes that it might be useful. Do not use it for evil.

import java.io.FileInputStream;

import java.util.Enumeration;
import java.util.Hashtable;  
import java.util.Properties; 

import com.meterware.httpunit.GetMethodWebRequest;
import com.meterware.httpunit.SubmitButton;       
import com.meterware.httpunit.WebClient;          
import com.meterware.httpunit.WebConversation;    
import com.meterware.httpunit.WebForm;            
import com.meterware.httpunit.WebLink;            
import com.meterware.httpunit.WebRequest;         
import com.meterware.httpunit.WebResponse;        

public class FormElements extends Properties
{                                           
  private static final String FORM_URL = "form.url";
  private static final String FORM_ACTION = "form.action";

  /** These are properly provided property parameters. */
  private static final String FORM_PARAM = "form.param.";

  /** These are property parameters that are required; must have values. */
  private static final String FORM_REQUIRED = "form.required.";            

  private Hashtable fields = new Hashtable( 10 );

  private WebConversation webConversation;

  public FormElements()
  {                    
  }                    

  /**
   * Retrieves the HTML page, populates the form data, then sends the
   * information to the server.                                      
   */                                                                
  public void run()                                                  
    throws Exception                                                 
  {                                                                  
    WebResponse response = receive();                                
    WebForm form = getWebForm( response );                           

    populate( form );

    form.submit();
  }               

  protected WebResponse receive()
    throws Exception             
  {                              
    WebConversation webConversation = getWebConversation();
    GetMethodWebRequest request = getGetMethodWebRequest();

    // Fake the User-Agent so the site thinks that encryption is supported.
    //                                                                     
    request.setHeaderField( "User-Agent",                                  
      "Mozilla/5.0 (X11; U; Linux i686; en-US; rv\\:1.7.3) Gecko/20040913" );

    return webConversation.getResponse( request );
  }                                               

  protected void populate( WebForm form )
    throws Exception                     
  {                                      
    // First set all the .param variables.
    //                                    
    setParamVariables( form );            

    // Next, set the required variables.
    //                                  
    setRequiredVariables( form );       
  }                                     

  protected void setParamVariables( WebForm form )
    throws Exception                              
  {                                               
    for( Enumeration e = propertyNames(); e.hasMoreElements(); )
    {                                                           
      String property = (String)(e.nextElement());              

      if( property.startsWith( FORM_PARAM ) )
      {                                      
        String fieldName = getProperty( property );
        String propertyName = property.substring( FORM_PARAM.length() );
        String fieldValue = getField( propertyName );                   

        // Skip blank fields (most likely, this is a blank last name, which
        // means the form wants a full name).                              
        //                                                                 
        if( "".equals( fieldName ) )                                       
          continue;                                                        

        // If this is the first name, and the last name parameter is blank,
        // then append the last name field to the first name field.        
        //                                                                 
        if( "first_name".equals( propertyName ) &&                         
            "".equals( getProperty( FORM_PARAM + "last_name" ) ) )         
          fieldValue += " " + getField( "last_name" );                     

        showSet( fieldName, fieldValue );
        form.setParameter( fieldName, fieldValue );
      }                                            
    }                                              
  }                                                

  protected void setRequiredVariables( WebForm form )
    throws Exception                                 
  {                                                  
    for( Enumeration e = propertyNames(); e.hasMoreElements(); )
    {                                                           
      String property = (String)(e.nextElement());              

      if( property.startsWith( FORM_REQUIRED ) )
      {                                         
        String fieldValue = getProperty( property );
        String fieldName = property.substring( FORM_REQUIRED.length() );

        // If the field starts with a ~, then copy the field.
        //                                                   
        if( fieldValue.startsWith( "~" ) )                   
        {                                                    
          String copyProp = fieldValue.substring( 1, fieldValue.length() );
          copyProp = getProperty( copyProp );                              

          // Since the parameters have been copied into the form, we can   
          // eke out the duplicate values.                                 
          //                                                               
          fieldValue = form.getParameterValue( copyProp );                 
        }                                                                  

        showSet( fieldName, fieldValue );
        form.setParameter( fieldName, fieldValue );
      }                                            
    }                                              
  }                                                

  private void showSet( String fieldName, String fieldValue )
  {                                                          
    System.out.print( "<p class='setting'>" );               
    System.out.print( fieldName );                           
    System.out.print( " = " );                               
    System.out.print( fieldValue );                          
    System.out.println( "</p>" );                            
  }                                                          

  private WebForm getWebForm( WebResponse response )
    throws Exception                                
  {                                                 
    WebForm[] forms = response.getForms();          
    String action = getProperty( FORM_ACTION );     

    // Not supposed to break out of a for-loop, but it makes the code easy ...
    //                                                                        
    for( int i = forms.length - 1; i >= 0; i-- )                              
      if( forms[ i ].getAction().equalsIgnoreCase( action ) )                 
        return forms[ i ];                                                    

    // Sadly, no form was found.
    //                          
    throw new Exception();      
  }                             

  private GetMethodWebRequest getGetMethodWebRequest()
  {
    return new GetMethodWebRequest( getProperty( FORM_URL ) );
  }

  private WebConversation getWebConversation()
  {
    if( this.webConversation == null )
      this.webConversation = new WebConversation();

    return this.webConversation;
  }

  public void setField( String field, String value )
  {
    Hashtable fields = getFields();
    fields.put( field, value );
  }

  private String getField( String field )
  {
    Hashtable<String, String> fields = getFields();
    String result = fields.get( field );

    return result == null ? "" : result;
  }

  private Hashtable getFields()
  {
    return this.fields;
  }

  public static void main( String args[] )
    throws Exception
  {
    FormElements formElements = new FormElements();

    formElements.setField( "first_name", args[1] );
    formElements.setField( "last_name", args[2] );
    formElements.setField( "email", args[3] );
    formElements.setField( "comments",  args[4] );

    FileInputStream fis = new FileInputStream( args[0] );
    formElements.load( fis );
    fis.close();

    formElements.run();
  }
}

An example properties files would look like:

$ cat com.mellon.properties

form.url=https://www.mellon.com/contact/index.cfm
form.action=index.cfm
form.param.first_name=name
form.param.last_name=
form.param.email=emailhome
form.param.comments=comments

# Submit Button
#form.submit=submit

# Required Fields
#
form.required.to=zzwebmaster
form.required.phone=555-555-1212
form.required.besttime=5 to 7pm

Run it similar to the following (substitute the path to HTTPUnit and the FormElements class for $CLASSPATH):

java -cp $CLASSPATH FormElements com.mellon.properties "John" "Doe" "[email protected]" "To whom it may concern  ..."

Legality

Another answer mentioned that it might violate terms of use. Check into that first, before you spend any time looking into a technical solution. Extremely good advice.

like image 90
Dave Jarvis Avatar answered Sep 20 '22 13:09

Dave Jarvis


Most of the time, you can just send a simple HTTP POST request.

I'd suggest you try playing around with Fiddler to understand how the web works.

Nearly all the programming languages and frameworks out there have methods for sending raw requests.

And you can always program against the Internet Explorer ActiveX control. I believe it many programming languages supports it.

like image 27
chakrit Avatar answered Sep 18 '22 13:09

chakrit


I believe this would put in legal violation of the terms of use (consult a lawyer about that: programmers are not good at giving legal advice!), but, technically, you could search for foobar by just visiting URL http://www.google.com/search?q=foobar and, as you say, scraping the resulting HTML. You'll probably also need to fake out the User-Agent HTTP header and maybe some others.

Maybe there are search engines whose terms of use do not forbid this; you and your lawyer might be well advised to look around to see if this is indeed the case.

like image 20
Alex Martelli Avatar answered Sep 19 '22 13:09

Alex Martelli


Well, here's the html from the Google page:

<form action="/search" name=f><table cellpadding=0 cellspacing=0><tr valign=top>
<td width=25%>&nbsp;</td><td align=center nowrap>
<input name=hl type=hidden value=en>
<input type=hidden name=ie value="ISO-8859-1">
<input autocomplete="off" maxlength=2048 name=q size=55 title="Google Search" value="">
<br>
<input name=btnG type=submit value="Google Search">
<input name=btnI type=submit value="I'm Feeling Lucky">
</td><td nowrap width=25% align=left>
<font size=-2>&nbsp;&nbsp;<a href=/advanced_search?hl=en>
Advanced Search</a><br>&nbsp;&nbsp;
<a href=/preferences?hl=en>Preferences</a><br>&nbsp;&nbsp;
<a href=/language_tools?hl=en>Language Tools</a></font></td></tr></table>
</form>

If you know how to make an HTTP request from your favorite programming language, just give it a try and see what you get back. Try this for instance:

http://www.google.com/search?hl=en&q=Stack+Overflow
like image 29
Mark Lutton Avatar answered Sep 20 '22 13:09

Mark Lutton