Data Scraping across
's

Question

I am trying to extract information from a repeating set of rows containing many embedded 's. For the page, I am trying to write a scraper to get various elements from this page. For some reason, I can't find a way to get to the tag with the class that contains the information for each row. Further, I am not able to isolate the sections that I will need to extract the information. For reference, here is a sample of one row:

<div id="dTeamEventResults" class="col-md-12 team-event-results"><div>
    <div class="row team-event-result team-result">
        <div class="col-md-12 main-info">
            <div class="row">
                <div class="col-md-7 event-name">
                    <dl>
                        <dt>Team Number:</dt> 
                        <dd><a href="/team-event-search/team?program=JFLL&amp;year=2017&amp;number=11733" class="result-name">11733</a></dd>
                        <dt>Team:</dt> 
                        <dd> Aqua Duckies</dd>
                        <dt>Program:</dt> 
                        <dd>FIRST LEGO League Jr.</dd>
                    </dl>
                </div>

The script I have started to build looks like the following:

from urllib2 import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = 'https://www.firstinspires.org/team-event-search#type=teams&sort=name&keyword=NJ&programs=FLLJR,FLL,FTC,FRC&year=2017'

uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

page_soup = soup(page_html, "html.parser")

rows = page_soup.findAll("div", {"class":"row team-event-result team-result"})

whenever I run len(rows), it always results in 0. I seem to have hit a wall and am having trouble. Thanks for your help!

SIM · Accepted Answer

The content of this page is generated dynamically so to catch that you need to use any browser simulator like selenium. Here is the script which will fetch your desired content. Give this a shot:

from bs4 import BeautifulSoup
from selenium  import webdriver

driver = webdriver.Chrome()
driver.get('https://www.firstinspires.org/team-event-search#type=teams&sort=name&keyword=NJ&programs=FLLJR,FLL,FTC,FRC&year=2017')
soup = BeautifulSoup(driver.page_source,"lxml")
for items in soup.select('.main-info'):
    docs = ' '.join([' '.join([item.text,' '.join(val.text.split())]) for item,val in zip(items.select(".event-name dt"),items.select(".event-name dd"))])
    location = ' '.join([' '.join(item.text.split()) for item in items.select(".event-location-type address")])
    print("Event_Info: {}
Event_Location: {}
".format(docs,location))
driver.quit()

The results look something like:

Event_Info: Team Number: 11733 Team: Aqua Duckies Program: FIRST LEGO League Jr.
Event_Location: Sparta, NJ 07871 USA

Event_Info: Team Number: 4281 Team: Bulldogs Program: FIRST Robotics Competition
Event_Location: Somerset, NJ 08873 USA

Data Scraping across <div>'s

Tags:

python

html

beautifulsoup

web-scraping

John Falconi

1 Answers

SIM

Recent Activity

Donate For Us