Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Beautiful Soup to parse url to get another urls data

I need to parse a url to get a list of urls that link to a detail page. Then from that page I need to get all the details from that page. I need to do it this way because the detail page url is not regularly incremented and changes, but the event list page stays the same.

Basically:

example.com/events/     <a href="http://example.com/events/1">Event 1</a>     <a href="http://example.com/events/2">Event 2</a>  example.com/events/1     ...some detail stuff I need  example.com/events/2     ...some detail stuff I need 
like image 297
tim Avatar asked Dec 16 '10 14:12

tim


People also ask

How do you scrape a URL using BeautifulSoup?

Steps to be followed:Create a function to get the HTML document from the URL using requests. get() method by passing URL to it. Create a Parse Tree object i.e. soup object using of BeautifulSoup() method, passing it HTML document extracted above and Python built-in HTML parser.

Which method in BeautifulSoup is used to check all URL or image?

Method 1: Using descendants and find() First, import the required modules, then provide the URL and create its requests object that will be parsed by the beautifulsoup object. Now with the help of find() function in beautifulsoup we will find the <body> and its corresponding <ul> tags.

What is URL parsing?

URL Parsing. The URL parsing functions focus on splitting a URL string into its components, or on combining URL components into a URL string.

Is BeautifulSoup a parser?

Beautiful Soup is a Python package for parsing HTML and XML documents (including having malformed markup, i.e. non-closed tags, so named after tag soup). It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping.


1 Answers

import urllib2 from BeautifulSoup import BeautifulSoup  page = urllib2.urlopen('http://yahoo.com').read() soup = BeautifulSoup(page) soup.prettify() for anchor in soup.findAll('a', href=True):     print anchor['href'] 

It will give you the list of urls. Now You can iterate over those urls and parse the data.

  • inner_div = soup.findAll("div", {"id": "y-shade"}) This is an example. You can go through the BeautifulSoup tutorials.
like image 83
Tauquir Avatar answered Sep 27 '22 17:09

Tauquir