Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

No output for parsing google news content

For my code here , I want to get the google new search title & URL .

It worked in the past .However , I don't know why it is not working now ?

Did Google change its CSS structure or what ?

Thanks

   public static void main(String[] args) throws UnsupportedEncodingException, IOException {

        String google = "http://www.google.com/search?q=";

        String search = "stackoverflow";

        String charset = "UTF-8";

        String news="&tbm=nws";


        String userAgent = "ExampleBot 1.0 (+http://example.com/bot)"; // Change this to your company's name and bot homepage!

        Elements links = Jsoup.connect(google + URLEncoder.encode(search , charset) + news).userAgent(userAgent).get().select( ".g>.r>.a");

        for (Element link : links) {
            String title = link.text();
            String url = link.absUrl("href"); // Google returns URLs in format "http://www.google.com/url?q=<url>&sa=U&ei=<someKey>".
            url = URLDecoder.decode(url.substring(url.indexOf('=') + 1, url.indexOf('&')), "UTF-8");

            if (!url.startsWith("http")) {
                continue; // Ads/news/etc.
            }
            System.out.println("Title: " + title);
            System.out.println("URL: " + url);
        }
    }
like image 785
evabb Avatar asked Jan 11 '17 04:01

evabb


1 Answers

If the question is "how do I get the code working again?" It would be difficult for anyone to know what the old page looked like unless they saved off a copy.

I broke down your select like this and it worked for me.

    String string = google + URLEncoder.encode(search , charset) + news;
    Document document = Jsoup.connect(string).userAgent(userAgent).get();
    Elements links = document.select( ".r>a");

The current page source looks like

       <div class="g">
        <table>
         <tbody>
          <tr>
           <td valign="top" style="width:516px"><h3 class="r"><a href="/url?q=https://www.bleepingcomputer.com/news/security/marlboro-ransomware-defeated-in-one-day/&amp;sa=U&amp;ved=0ahUKEwis77iq7cDRAhXI7IMKHUAoDs0QqQIIFCgAMAE&amp;usg=AFQjCNFFx-sJdU814auBfquRYSsct2c8WA">Marlboro Ransomware Defeated in One Day</a></h3>

Results: Title: Marlboro Ransomware Defeated in One Day URL: https://www.bleepingcomputer.com/news/security/marlboro-ransomware-defeated-in-one-day/

Title: Stack Overflow puts a new spin on resumes for developers URL: https://techcrunch.com/2016/10/11/stack-overflow-puts-a-new-spin-on-resumes-for-developers/

Edited - Time range These URL parameters look awful.
Add the suffix &tbs=cdr%3A1%2Ccd_min%3A5%2F30%2F2016%2Ccd_max%3A6%2F30%2F2016

But this part "min%3A5%2F30%2F2016" contains your minimum date. 5 30 2016. min%3A + (month of year) + %2F + (day of month) + %2F + year And in "max%3A6%2F30%2F2016" is your maximum date. 6 30 2016. max%3A + (month of year) + %2F + (day of month) + %2F + year

Here's the full URL searching for Mindy Kaling between 05/30/2016 and 06/30/2016 https://www.google.com/search?tbm=nws&q=mindy%20kaling&tbs=cdr%3A1%2Ccd_min%3A5%2F30%2F2016%2Ccd_max%3A6%2F30%2F2016

like image 60
ProgrammersBlock Avatar answered Sep 21 '22 03:09

ProgrammersBlock