Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python and BeautifulSoup Opening pages

I am wondering how would I open another page in my list with BeautifulSoup? I have followed this tutorial, but it does not tell us how to open another page on the list. Also how would I open a "a href" that is nested inside of a class?

Here is my code:

# coding: utf-8

import requests
from bs4 import BeautifulSoup

r = requests.get("")
soup = BeautifulSoup(r.content)
soup.find_all("a")

for link in soup.find_all("a"):
    print link.get("href")

    for link in soup.find_all("a"):
        print link.text

    for link in soup.find_all("a"):
        print link.text, link.get("href")

    g_data = soup.find_all("div", {"class":"listing__left-column"})

    for item in g_data:
        print item.contents

    for item in g_data:
        print item.contents[0].text
        print link.get('href')

    for item in g_data:
        print item.contents[0]

I am trying to collect the href's from the titles of each business, and then open them and scrape that data.

like image 641
Brendan Cott Avatar asked Sep 24 '15 05:09

Brendan Cott


People also ask

How do I open a URL in Beautiful Soup?

Steps to be followed:Create a function to get the HTML document from the URL using requests. get() method by passing URL to it. Create a Parse Tree object i.e. soup object using of BeautifulSoup() method, passing it HTML document extracted above and Python built-in HTML parser.

How do I run a Beautiful Soup in Python?

To use beautiful soup, you need to install it: $ pip install beautifulsoup4 . Beautiful Soup also relies on a parser, the default is lxml . You may already have it, but you should check (open IDLE and attempt to import lxml). If not, do: $ pip install lxml or $ apt-get install python-lxml .

Which is better Selenium or Beautiful Soup?

The main difference between Selenium and Beautiful Soup is that Selenium is ideal for complex projects while Beautiful Soup is best for smaller projects. Read on to learn more of the differences! The choice between using these two scraping technologies will likely reflect the scope of the project.


2 Answers

I am still not sure where you are getting the HTML from, but if you are trying to extract all of the href tags, then the following approach should work based on the image you have posted:

import requests
from bs4 import BeautifulSoup

r = requests.get("<add your URL here>")
soup = BeautifulSoup(r.content)

for a_tag in soup.find_all('a', class_='listing-name', href=True):
    print 'href: ', a_tag['href']

By adding href=True to the find_all(), it ensures that only a elements that contain an href attribute are returned therefore removing the need to test for it as an attribute.

Just to warn you, you might find some websites will lock you out after one or two attempts as they are able to detect that you are trying to access a site via a script, rather than as a human. If you feel you are not getting the correct responses, I would recommend printing the HTML you are getting back to ensure it it still as you expect.

If you then want to get the HTML for each of the links, the following could be used:

import requests
from bs4 import BeautifulSoup

# Configure this to be your first request URL
r = requests.get("http://www.mywebsite.com/search/")
soup = BeautifulSoup(r.content)

for a_tag in soup.find_all('a', class_='listing-name', href=True):
    print 'href: ', a_tag['href']

# Configure this to the root of the above website, e.g. 'http://www.mywebsite.com'
base_url = "http://www.mywebsite.com"

for a_tag in soup.find_all('a', class_='listing-name', href=True):
    print '-' * 60      # Add a line of dashes
    print 'href: ', a_tag['href']
    request_href = requests.get(base_url + a_tag['href'])
    print request_href.content

Tested using Python 2.x, for Python 3.x please add parentheses to the print statements.

like image 90
Martin Evans Avatar answered Oct 22 '22 10:10

Martin Evans


  1. I had the same problem and I will like to share my findings because I did try the answer, for some reasons it did not work but after some research I found something interesting.

  2. You might need to find the attributes of the "href" link itself: You will need the exact class which contains the href link in your case, I am thinking="class":"listing__left-column" and equate it to a variable say "all" for example:

from bs4 import BeautifulSoup
all = soup.find_all("div", {"class":"listing__left-column"})
for item in all:
  for link in item.find_all("a"):
    if 'href' in link.attrs:
        a = link.attrs['href']
        print(a)
        print("")

I did this and I was able to get into another link which was embedded in the home page

like image 28
temi Avatar answered Oct 22 '22 09:10

temi