Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Data Scraping across <div>'s

I am trying to extract information from a repeating set of rows containing many embedded 's. For the page, I am trying to write a scraper to get various elements from this page. For some reason, I can't find a way to get to the tag with the class that contains the information for each row. Further, I am not able to isolate the sections that I will need to extract the information. For reference, here is a sample of one row:

<div id="dTeamEventResults" class="col-md-12 team-event-results"><div>
    <div class="row team-event-result team-result">
        <div class="col-md-12 main-info">
            <div class="row">
                <div class="col-md-7 event-name">
                    <dl>
                        <dt>Team Number:</dt> 
                        <dd><a href="/team-event-search/team?program=JFLL&amp;year=2017&amp;number=11733" class="result-name">11733</a></dd>
                        <dt>Team:</dt> 
                        <dd> Aqua Duckies</dd>
                        <dt>Program:</dt> 
                        <dd>FIRST LEGO League Jr.</dd>
                    </dl>
                </div>

The script I have started to build looks like the following:

from urllib2 import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = 'https://www.firstinspires.org/team-event-search#type=teams&sort=name&keyword=NJ&programs=FLLJR,FLL,FTC,FRC&year=2017'

uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

page_soup = soup(page_html, "html.parser")

rows = page_soup.findAll("div", {"class":"row team-event-result team-result"})

whenever I run len(rows), it always results in 0. I seem to have hit a wall and am having trouble. Thanks for your help!

like image 207
John Falconi Avatar asked Dec 09 '25 12:12

John Falconi


1 Answers

The content of this page is generated dynamically so to catch that you need to use any browser simulator like selenium. Here is the script which will fetch your desired content. Give this a shot:

from bs4 import BeautifulSoup
from selenium  import webdriver

driver = webdriver.Chrome()
driver.get('https://www.firstinspires.org/team-event-search#type=teams&sort=name&keyword=NJ&programs=FLLJR,FLL,FTC,FRC&year=2017')
soup = BeautifulSoup(driver.page_source,"lxml")
for items in soup.select('.main-info'):
    docs = ' '.join([' '.join([item.text,' '.join(val.text.split())]) for item,val in zip(items.select(".event-name dt"),items.select(".event-name dd"))])
    location = ' '.join([' '.join(item.text.split()) for item in items.select(".event-location-type address")])
    print("Event_Info: {}\nEvent_Location: {}\n".format(docs,location))
driver.quit()

The results look something like:

Event_Info: Team Number: 11733 Team: Aqua Duckies Program: FIRST LEGO League Jr.
Event_Location: Sparta, NJ 07871 USA

Event_Info: Team Number: 4281 Team: Bulldogs Program: FIRST Robotics Competition
Event_Location: Somerset, NJ 08873 USA
like image 175
SIM Avatar answered Dec 12 '25 02:12

SIM



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!