Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to scrape a _private_ google group?

I'd like to scrape the discussion list of a private google group. It's a multi-page list and I might have to this later again so scripting sounds like the way to go.

Since this is a private group, I need to login in my google account first. Unfortunately I can't manage to login using wget or ruby Net::HTTP. Surprisingly google groups is not accessible with the Client Login interface, so all the code samples are useless.

My ruby script is embedded at the end of the post. The response to the authentication query is a 200-OK but no cookies in the response headers and the body contains the message "Your browser's cookie functionality is turned off. Please turn it on."

I got the same output with wget. See the bash script at the end of this message.

I don't know how to workaround this. am I missing something? Any idea?

Thanks in advance.

John

Here is the ruby script:

# a ruby script
require 'net/https'

http = Net::HTTP.new('www.google.com', 443)
http.use_ssl = true
path = '/accounts/ServiceLoginAuth'


email='[email protected]'
password='topsecret'

# form inputs from the login page
data = "Email=#{email}&Passwd=#{password}&dsh=7379491738180116079&GALX=irvvmW0Z-zI"
headers =  { 'Content-Type' => 'application/x-www-form-urlencoded',
'user-agent' => "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.2 (KHTML, like Gecko) Chrome/6.0"}

# Post the request and print out the response to retrieve our authentication token
resp, data = http.post(path, data, headers)
puts resp
resp.each {|h, v| puts h+'='+v}

#warning: peer certificate won't be verified in this SSL session

Here is the bash script:

# A bash script for wget
CMD=""
CMD="$CMD --keep-session-cookies --save-cookies cookies.tmp"
CMD="$CMD --no-check-certificate"
CMD="$CMD --post-data='[email protected]&Passwd=topsecret&dsh=-8408553335275857936&GALX=irvvmW0Z-zI'"
CMD="$CMD --user-agent='Mozilla'"
CMD="$CMD https://www.google.com/accounts/ServiceLoginAuth"
echo $CMD
wget $CMD
wget --load-cookies="cookies.tmp" http://groups.google.com/group/mygroup/topics?tsc=2
like image 886
John Avatar asked Apr 02 '10 09:04

John


People also ask

How do I get removed from a Google Group?

To leave a group and stop getting email from it, you can send an email to group name+unsubscribe@group domain. For example, to leave the group [email protected], you would send a message to [email protected]. You can send the request with no subject or body text.


2 Answers

Have you tried with mechanize for ruby?
Mechanize library is used for automating interaction with website; you could log in to google and browse your private google group saving what you need.

Here an example where mechanize is used for gmail scraping.

like image 144
systempuntoout Avatar answered Sep 17 '22 14:09

systempuntoout


I did this previously by logging in manually with Firefox and then used Chickenfoot to automate browsing and scraping.

like image 31
hoju Avatar answered Sep 20 '22 14:09

hoju