Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to crawl a website/extract data into database with python?

I'd like to build a webapp to help other students at my university create their schedules. To do that I need to crawl the master schedules (one huge html page) as well as a link to a detailed description for each course into a database, preferably in python. Also, I need to log in to access the data.

  • How would that work?
  • What tools/libraries can/should I use?
  • Are there good tutorials on that?
  • How do I best deal with binary data (e.g. pretty pdf)?
  • Are there already good solutions for that?
like image 984
McEnroe Avatar asked Dec 01 '11 01:12

McEnroe


People also ask

Is web scraping with Python legal?

Scraping for personal purposes is usually OK, even if it is copyrighted information, as it could fall under the fair use provision of the intellectual property legislation. However, sharing data for which you don't hold the right to share is illegal.


2 Answers

I liked using BeatifulSoup for extracting html data

It's as easy as this:

from BeautifulSoup import BeautifulSoup 
import urllib

ur = urllib.urlopen("http://pragprog.com/podcasts/feed.rss")
soup = BeautifulSoup(ur.read())
items = soup.findAll('item')

urls = [item.enclosure['url'] for item in items]
like image 178
Alexey Grigorev Avatar answered Oct 13 '22 02:10

Alexey Grigorev


  • requests for downloading the pages.
    • Here's an example of how to login to a website and download pages: https://stackoverflow.com/a/8316989/311220
  • lxml for scraping the data.

If you want to use a powerful scraping framework there's Scrapy. It has some good documentation too. It may be a little overkill depending on your task though.

like image 40
Acorn Avatar answered Oct 13 '22 01:10

Acorn