How to read an entire web page into a variable

Question

I am trying to read an entire web page and assign it to a variable, but am having trouble doing that. The variable seems to only be able to hold the first 512 or so lines of the page source.

I tried using readlines() to just print all lines of the source to the screen, and that gave me the source in its entirety, but I need to be able to parse it with regex, so I need to store it in a variable somehow. Help?

 data = urllib2.urlopen(url)
 print data

Only gives me about 1/3 of the source.

 data = urllib2.urlopen(url)
 for lines in data.readlines()
      print lines

This gives me the entire source.

Like I said, I need to be able to parse the string with regex, but the part I need isn't in the first 1/3 I'm able to store in my variable.

vaebnkehn · Accepted Answer

You probably are looking for beautiful soup: http://www.crummy.com/software/BeautifulSoup/ It's an open source web parsing library for python. Best of luck!

Adam Mihalcin · Answer

You should be able to use file.read() to read the entire file into a string. That will give you the entire source. Something like

data = urllib2.urlopen(url)
print data.read()

should give you the entire webpage.

From there, don't parse HTML with regex (well-worn post to this effect here), but use a dedicated HTML parser instead. Alternatively, clean up the HTML and convert it to XHTML (for instance with HTML Tidy), and then use an XML parsing library like the standard ElementTree. Which approach is best depends on your application.

How to read an entire web page into a variable

Tags:

python

urllib2

web-scraping

Rentafence

2 Answers

vaebnkehn

Adam Mihalcin

Recent Activity

Donate For Us

How to read an entire web page into a variable

Tags:

python

urllib2

web-scraping

Rentafence

2 Answers

vaebnkehn

Adam Mihalcin

Related questions

Recent Activity

Donate For Us