Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using JSoup to scrape Google Results

Tags:

java

jsoup

I'm trying to use JSoup to scrape the search results from Google. Currently this is my code.

public class GoogleOptimization {
public static void main (String args[])
{
    Document doc;
    try{
        doc = Jsoup.connect("https://www.google.com/search?as_q=&as_epq=%22Yorkshire+Capital%22+&as_oq=fraud+OR+allegations+OR+scam&as_eq=&as_nlo=&as_nhi=&lr=lang_en&cr=countryCA&as_qdr=all&as_sitesearch=&as_occt=any&safe=images&tbs=&as_filetype=&as_rights=").userAgent("Mozilla").ignoreHttpErrors(true).timeout(0).get();
        Elements links = doc.select("what should i put here?");
        for (Element link : links) {
                System.out.println("\n"+link.text());
    }
    }
    catch (IOException e) {
        e.printStackTrace();
    }
}

}

I'm just trying to get the title of the search results and the snippets below the title. So yea, I just don't know what element to look for in order to scrape these. If anyone has a better method to scrape Google using java I would love to know.

Thanks.

like image 936
user2405920 Avatar asked Jul 17 '13 18:07

user2405920


1 Answers

Here you go.

public class ScanWebSO 
{
public static void main (String args[])
{
    Document doc;
    try{
        doc =        Jsoup.connect("https://www.google.com/search?as_q=&as_epq=%22Yorkshire+Capital%22+&as_oq=fraud+OR+allegations+OR+scam&as_eq=&as_nlo=&as_nhi=&lr=lang_en&cr=countryCA&as_qdr=all&as_sitesearch=&as_occt=any&safe=images&tbs=&as_filetype=&as_rights=").userAgent("Mozilla").ignoreHttpErrors(true).timeout(0).get();
        Elements links = doc.select("li[class=g]");
        for (Element link : links) {
            Elements titles = link.select("h3[class=r]");
            String title = titles.text();

            Elements bodies = link.select("span[class=st]");
            String body = bodies.text();

            System.out.println("Title: "+title);
            System.out.println("Body: "+body+"\n");
        }
    }
    catch (IOException e) {
        e.printStackTrace();
    }
}
}

Also, to do this yourself I would suggest using chrome. You just right click on whatever you want to scrape and go to inspect element. It will take you to the exact spot in the html where that element is located. In this case you first want to find out where the root of all the result listings are. When you find that, you want to specify the element, and preferably an unique attribute to search it by. In this case the root element is

<ol eid="" id="rso">

Below that you will see a bunch of listings that start with

<li class="g"> 

This is what you want to put into your initial elements array, then for each element you will want to find the spot where the title and body are. In this case, I found the title to be under the

<h3 class="r" style="white-space: normal;">

element. So you will search for that element in each listing. The same goes for the body. I found the body to be under so I searched for that using the .text() method and it returned all the text under that element. The key is to ALWAYS try and find the element with an original attribute (using a class name is ideal). If you don't and only search for something like "div" it will search the entire page for ANY element containing div and return that. So you will get WAY more results than you want. I hope this explains it well. Let me know if you have any more questions.

like image 93
Collin Avatar answered Nov 14 '22 23:11

Collin