Web scraping both html text and image link with Python Beautifulsoup

Question

I'm new to Python and tying to scrape the table from this URL using BeautifulSoup: http://www.espn.com/college-sports/basketball/recruiting/databaseresults?firstname=&lastname=&class=2007&starsfilter=GT&stars=0&ratingfilter=GT&rating=&positionrank=&sportid=4294967265&collegeid=&conference=&visitmonth=&visityear=&statuscommit=Commitments&statusuncommit=Uncommited&honor=&region=&state=&height=&weight=

So far, I've figured out how to pull the table data for each player's row, as well as the link to the school logo in each row. However, I'm having trouble combining the two. I want to pull the table data for each player (player_data in the code below) as well as their corresponding school logo image link (logo_links), and do so into one row per player in a saved CSV.

Below is what I have so far. Thanks in advance for the help.

#! python3
# downloadRecruits.py - Downloads espn college basketball recruiting database info

import requests, os, bs4, csv
import pandas as pd

# Starting url (class of 2007)
url = 'http://www.espn.com/college-sports/basketball/recruiting/databaseresults?firstname=&lastname=&class=2007&starsfilter=GT&stars=0&ratingfilter=GT&rating=&positionrank=&sportid=4294967265&collegeid=&conference=&visitmonth=&visityear=&statuscommit=Commitments&statusuncommit=Uncommited&honor=&region=&state=&height=&weight='


# Download the page
print('Downloading page %s...' % url)
res = requests.get(url)
res.raise_for_status()

# Creating bs object
soup = bs4.BeautifulSoup(res.text, "html.parser")

# Get the data
data_rows = soup.findAll('tr')[1:]
type(data_rows)

player_data = [[td.getText() for td in data_rows[i].findAll('td')] for i in range(len(data_rows))]

logo_links = [a['href'] for div in soup.find_all("div", attrs={"class": "school-logo"}) for a in div.find_all('a')]


# Saving only player_data
with open('recruits2.csv', 'w') as f_output:
   csv_output = csv.writer(f_output)
   csv_output.writerows(player_data)

Sachin · Accepted Answer

I would do something like this.
Reason 1: You don't have to look up two times in the HTML for your content.
Reason 2: Following reason 1, you don't have to run the loops again.

player_data = []
for tr in data_rows:
    tdata = []
    for td in tr:
        tdata.append(td.getText())

        if td.div and td.div['class'][0] == 'school-logo':
            tdata.append(td.div.a['href'])

    player_data.append(tdata)

Small explanation -
Mainly, I haven't used list comprehension because of the if block which looks for the div block in HTML which has the required class name, if it does, it appends to the list of data it collects in tr tag.

Web scraping both html text and image link with Python Beautifulsoup

Tags:

python

beautifulsoup

web-scraping

NateRattner

1 Answers

Sachin

Recent Activity

Donate For Us

Web scraping both html text and image link with Python Beautifulsoup

Tags:

python

beautifulsoup

web-scraping

NateRattner

1 Answers

Sachin

Related questions

Recent Activity

Donate For Us