I need to parse a url to get a list of urls that link to a detail page. Then from that page I need to get all the details from that page. I need to do it this way because the detail page url is not regularly incremented and changes, but the event list page stays the same.
Basically:
example.com/events/ <a href="http://example.com/events/1">Event 1</a> <a href="http://example.com/events/2">Event 2</a> example.com/events/1 ...some detail stuff I need example.com/events/2 ...some detail stuff I need
Steps to be followed:Create a function to get the HTML document from the URL using requests. get() method by passing URL to it. Create a Parse Tree object i.e. soup object using of BeautifulSoup() method, passing it HTML document extracted above and Python built-in HTML parser.
Method 1: Using descendants and find() First, import the required modules, then provide the URL and create its requests object that will be parsed by the beautifulsoup object. Now with the help of find() function in beautifulsoup we will find the <body> and its corresponding <ul> tags.
URL Parsing. The URL parsing functions focus on splitting a URL string into its components, or on combining URL components into a URL string.
Beautiful Soup is a Python package for parsing HTML and XML documents (including having malformed markup, i.e. non-closed tags, so named after tag soup). It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping.
import urllib2 from BeautifulSoup import BeautifulSoup page = urllib2.urlopen('http://yahoo.com').read() soup = BeautifulSoup(page) soup.prettify() for anchor in soup.findAll('a', href=True): print anchor['href']
It will give you the list of urls. Now You can iterate over those urls and parse the data.
inner_div = soup.findAll("div", {"id": "y-shade"})
This is an example. You can go through the BeautifulSoup tutorials.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With