I can get the html page using urllib, and use BeautifulSoup to parse the html page, and it looks like that I have to generate file to be read from BeautifulSoup.
import urllib
sock = urllib.urlopen("http://SOMEWHERE")
htmlSource = sock.read()
sock.close()
--> write to file
Is there a way to call BeautifulSoup without generating file from urllib?
Steps involved in web scraping: Find the URL of the webpage that you want to scrape. Select the particular elements by inspecting. Write the code to get the content of the selected elements. Store the data in the required format.
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(htmlSource)
No file writing needed: Just pass in the HTML string. You can also pass the object returned from urlopen
directly:
f = urllib.urlopen("http://SOMEWHERE")
soup = BeautifulSoup(f)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With