Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

what does read() in urlopen('http.....').read() do? [urllib]

Hi I'm reading "Web Scraping with Python (2015)". I saw the following two ways of opening url, with and without using .read(). See bs1 and bs2

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://web.stanford.edu/~zlotnick/TextAsData/Web_Scraping_with_Beautiful_Soup.html')
bs1 = BeautifulSoup(html.read(), 'html.parser')

html = urlopen('http://web.stanford.edu/~zlotnick/TextAsData/Web_Scraping_with_Beautiful_Soup.html')
bs2 = BeautifulSoup(html, 'html.parser')

bs1 == bs2 # true


print(bs1.prettify()[0:100])
print(bs2.prettify()[0:100]) # prints same thing

So is .read() redundant? Thanks

Code on p7 of Web scpraing with python: (use .read())

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page1.html")
bsObj = BeautifulSoup(html.read())

Code on p15 (without .read())

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/warandpeace.html")
bsObj = BeautifulSoup(html)
like image 746
YJZ Avatar asked Mar 08 '16 09:03

YJZ


1 Answers

Quoting BS docs:

To parse a document, pass it into the BeautifulSoup constructor. You can pass in a string or an open filehandle:

When you're using .read() method you use an "string" inteface. When you are not, you're using "filehandle" interface.

Effectively it works same way (although BS4 may read file-like object in lazy way). In your case whole content is read to string object (it's may consume more memory unnecessarily).

like image 60
Łukasz Rogalski Avatar answered Sep 21 '22 18:09

Łukasz Rogalski