Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Web scraping both html text and image link with Python Beautifulsoup

I'm new to Python and tying to scrape the table from this URL using BeautifulSoup: http://www.espn.com/college-sports/basketball/recruiting/databaseresults?firstname=&lastname=&class=2007&starsfilter=GT&stars=0&ratingfilter=GT&rating=&positionrank=&sportid=4294967265&collegeid=&conference=&visitmonth=&visityear=&statuscommit=Commitments&statusuncommit=Uncommited&honor=&region=&state=&height=&weight=

So far, I've figured out how to pull the table data for each player's row, as well as the link to the school logo in each row. However, I'm having trouble combining the two. I want to pull the table data for each player (player_data in the code below) as well as their corresponding school logo image link (logo_links), and do so into one row per player in a saved CSV.

Below is what I have so far. Thanks in advance for the help.

#! python3
# downloadRecruits.py - Downloads espn college basketball recruiting database info

import requests, os, bs4, csv
import pandas as pd

# Starting url (class of 2007)
url = 'http://www.espn.com/college-sports/basketball/recruiting/databaseresults?firstname=&lastname=&class=2007&starsfilter=GT&stars=0&ratingfilter=GT&rating=&positionrank=&sportid=4294967265&collegeid=&conference=&visitmonth=&visityear=&statuscommit=Commitments&statusuncommit=Uncommited&honor=&region=&state=&height=&weight='


# Download the page
print('Downloading page %s...' % url)
res = requests.get(url)
res.raise_for_status()

# Creating bs object
soup = bs4.BeautifulSoup(res.text, "html.parser")

# Get the data
data_rows = soup.findAll('tr')[1:]
type(data_rows)

player_data = [[td.getText() for td in data_rows[i].findAll('td')] for i in range(len(data_rows))]

logo_links = [a['href'] for div in soup.find_all("div", attrs={"class": "school-logo"}) for a in div.find_all('a')]


# Saving only player_data
with open('recruits2.csv', 'w') as f_output:
   csv_output = csv.writer(f_output)
   csv_output.writerows(player_data)
like image 707
NateRattner Avatar asked Nov 01 '25 01:11

NateRattner


1 Answers

I would do something like this.
Reason 1: You don't have to look up two times in the HTML for your content.
Reason 2: Following reason 1, you don't have to run the loops again.

player_data = []
for tr in data_rows:
    tdata = []
    for td in tr:
        tdata.append(td.getText())

        if td.div and td.div['class'][0] == 'school-logo':
            tdata.append(td.div.a['href'])

    player_data.append(tdata)

Small explanation -
Mainly, I haven't used list comprehension because of the if block which looks for the div block in HTML which has the required class name, if it does, it appends to the list of data it collects in tr tag.

like image 99
Sachin Avatar answered Nov 03 '25 16:11

Sachin



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!