Hi I'm reading "Web Scraping with Python (2015)". I saw the following two ways of opening url, with and without using .read()
. See bs1
and bs2
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('http://web.stanford.edu/~zlotnick/TextAsData/Web_Scraping_with_Beautiful_Soup.html')
bs1 = BeautifulSoup(html.read(), 'html.parser')
html = urlopen('http://web.stanford.edu/~zlotnick/TextAsData/Web_Scraping_with_Beautiful_Soup.html')
bs2 = BeautifulSoup(html, 'html.parser')
bs1 == bs2 # true
print(bs1.prettify()[0:100])
print(bs2.prettify()[0:100]) # prints same thing
So is .read()
redundant? Thanks
Code on p7 of Web scpraing with python: (use .read()
)
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page1.html")
bsObj = BeautifulSoup(html.read())
Code on p15 (without .read()
)
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/warandpeace.html")
bsObj = BeautifulSoup(html)
Quoting BS docs:
To parse a document, pass it into the BeautifulSoup constructor. You can pass in a string or an open filehandle:
When you're using .read() method you use an "string" inteface. When you are not, you're using "filehandle" interface.
Effectively it works same way (although BS4 may read file-like object in lazy way). In your case whole content is read to string object (it's may consume more memory unnecessarily).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With