Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting HTML with Pycurl

Tags:

python

pycurl

I've been trying to retrieve a page of HTML using pycurl, so I can then parse it for relevant information using str.split and some for loops. I know Pycurl retrieves the HTML, since it prints it to the terminal, however, if I try to do something like

html = str(c.perform())  

The variable will just hold a string which says "None".

How can I use pycurl to get the html, or redirect whatever it sends to the console so it can be used as a string as described above?

Thanks a lot to anyone who has any suggestions!

like image 945
Sinthet Avatar asked Jul 02 '11 00:07

Sinthet


1 Answers

The perform() method executes the html fetch and writes the result to a function you specify. You need to provide a buffer to put the html into and a write function. Usually, this can be accomplished using a StringIO object as follows:

import pycurl
import StringIO

c = pycurl.Curl()
c.setopt(pycurl.URL, "http://www.google.com/")

b = StringIO.StringIO()
c.setopt(pycurl.WRITEFUNCTION, b.write)
c.setopt(pycurl.FOLLOWLOCATION, 1)
c.setopt(pycurl.MAXREDIRS, 5)
c.perform()
html = b.getvalue()

You could also use a file or tempfile or anything else that can store data.

like image 53
MakeSomething Avatar answered Sep 28 '22 01:09

MakeSomething