Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Script to Extract data from web page

Tags:

python

I am looking to extract some parts of data rendered on a web page. I am able to pull the entire data from the page and save it in a text file (RAW) using the code below.

curl http://webpage -o "raw.txt"

Just wondering if there were other alternatives and advantages whatsoever.

like image 710
Selase Avatar asked May 29 '12 21:05

Selase


2 Answers

I would use a combination of requests, and BeautifulSoup.

from bs4 import BeautifulSoup
import requests    
    
session = requests.session()    
req = session.get('http://stackoverflow.com/questions/10807081/script-to-extract-data-from-wbpage')    
doc = BeautifulSoup(req.content)    
print(doc.findAll('a', { "class" : "gp-share" }))
like image 109
sberry Avatar answered Oct 30 '22 04:10

sberry


cURL is a good start. A better command line will be :

curl -A "Mozilla/5.0" -L -k -b /tmp/c -c /tmp/c -s http://url.tld

because it plays with cookies, user-agent, SSL certificates and others things.

See man curl

like image 34
Gilles Quenot Avatar answered Oct 30 '22 06:10

Gilles Quenot