Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read an entire web page into a variable

I am trying to read an entire web page and assign it to a variable, but am having trouble doing that. The variable seems to only be able to hold the first 512 or so lines of the page source.

I tried using readlines() to just print all lines of the source to the screen, and that gave me the source in its entirety, but I need to be able to parse it with regex, so I need to store it in a variable somehow. Help?

 data = urllib2.urlopen(url)
 print data

Only gives me about 1/3 of the source.

 data = urllib2.urlopen(url)
 for lines in data.readlines()
      print lines

This gives me the entire source.

Like I said, I need to be able to parse the string with regex, but the part I need isn't in the first 1/3 I'm able to store in my variable.

like image 613
Rentafence Avatar asked Jun 06 '12 04:06

Rentafence


2 Answers

You probably are looking for beautiful soup: http://www.crummy.com/software/BeautifulSoup/ It's an open source web parsing library for python. Best of luck!

like image 179
vaebnkehn Avatar answered Oct 05 '22 22:10

vaebnkehn


You should be able to use file.read() to read the entire file into a string. That will give you the entire source. Something like

data = urllib2.urlopen(url)
print data.read()

should give you the entire webpage.

From there, don't parse HTML with regex (well-worn post to this effect here), but use a dedicated HTML parser instead. Alternatively, clean up the HTML and convert it to XHTML (for instance with HTML Tidy), and then use an XML parsing library like the standard ElementTree. Which approach is best depends on your application.

like image 39
Adam Mihalcin Avatar answered Oct 05 '22 23:10

Adam Mihalcin