Python Web Crawlers and "getting" html source code

Tags:

So my brother wanted me to write a web crawler in Python (self-taught) and I know C++, Java, and a bit of html. I'm using version 2.7 and reading the python library, but I have a few problems 1. httplib.HTTPConnection and request concept to me is new and I don't understand if it downloads an html script like cookie or an instance. If you do both of those, do you get the source for a website page? And what are some words that I would need to know to modify the page and return the modified page.

Just for background, I need to download a page and replace any img with ones I have

And it would be nice if you guys could tell me your opinion of 2.7 and 3.1

713

asked Aug 20 '10 17:08

Dan

3 Answers

~~Use Python 2.7, is has more 3rd party libs at the moment.~~ (Edit: see below).

I recommend you using the stdlib module urllib2, it will allow you to comfortably get web resources. Example:

import urllib2

response = urllib2.urlopen("http://google.de")
page_source = response.read()

For parsing the code, have a look at BeautifulSoup.

BTW: what exactly do you want to do:

Just for background, I need to download a page and replace any img with ones I have

Edit: It's 2014 now, most of the important libraries have been ported, and you should definitely use Python 3 if you can. python-requests is a very nice high-level library which is easier to use than urllib2.

answered Oct 02 '22 07:10

leoluk

An Example with python3 and the requests library as mentioned by @leoluk:

pip install requests

Script req.py:

import requests

url='http://localhost'

# in case you need a session
cd = { 'sessionid': '123..'}

r = requests.get(url, cookies=cd)
# or without a session: r = requests.get(url)
r.content

Now,execute it and you will get the html source of localhost!

python3 req.py

answered Oct 02 '22 08:10

Timo

If you are using Python > 3.x you don't need to install any libraries, this is directly built in the python framework. The old urllib2 package has been renamed to urllib:

from urllib import request

response = request.urlopen("https://www.google.com")
# set the correct charset below
page_source = response.read().decode('utf-8')
print(page_source)

answered Oct 02 '22 08:10

Caner

Related questions
                            
                                Sending data received in one Twisted factory to second factory
                            
                                Change python byte type to string
                            
                                Fourier space filtering
                            
                                How to protect a Google App Engine app with a password?
                            
                                Pythonic way of copying an iterable object
                            
                                Inline SVG Served By Python Script in Google App Engine Not Appearing
                            
                                determining whether a MIME type is binary or text-based
                            
                                Adding and removing audio sources to/from GStreamer pipeline on-the-go
                            
                                What is PyObjC?
                            
                                How should I comment partial Python functions?
                            
                                SQLAlchemy equivalent to Django's annotate() method
                            
                                a good solution to set up a rdf triplestore in python?
                            
                                Is there any way to generate tornado localization CSV file like django makemessage?
                            
                                Appending tuples to lists
                            
                                How to compare 2 iframes and get difference visually?
                            
                                Disable all `pylint` 'Convention' messages
                            
                                Generating predictions from inferred parameters in pymc3
                            
                                How to group the choices in a Django Select widget?
                            
                                Python complex dictionary keys
                            
                                Is there a way to make Colab give an Audio Notification when cell has finished running

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python Web Crawlers and "getting" html source code

Tags:

python

get

web-crawler

Dan

People also ask

3 Answers

leoluk

Timo

Caner

Recent Activity

Donate For Us