I am looking to extract some parts of data rendered on a web page. I am able to pull the entire data from the page and save it in a text file (RAW) using the code below.
curl http://webpage -o "raw.txt"
Just wondering if there were other alternatives and advantages whatsoever.
I would use a combination of requests, and BeautifulSoup.
from bs4 import BeautifulSoup
import requests
session = requests.session()
req = session.get('http://stackoverflow.com/questions/10807081/script-to-extract-data-from-wbpage')
doc = BeautifulSoup(req.content)
print(doc.findAll('a', { "class" : "gp-share" }))
cURL is a good start. A better command line will be :
curl -A "Mozilla/5.0" -L -k -b /tmp/c -c /tmp/c -s http://url.tld
because it plays with cookies, user-agent, SSL certificates and others things.
See man curl
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With