Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

scraping greatschools.org using BeautifulSoup returns empty list

I've been learning how to scrape the greatschools.org website using BeautifulSoup. I've run into a dead end despite looking up different solutions here and in other places. By using the "inspect" feature on chrome I can see that the website has table tags but a find_all('tr') or find_all('table') or find_all('tbody') returns an empty list. What am I missing?

here's the code block that I'm using:

import requests
from bs4 import BeautifulSoup

url = "https://www.greatschools.org/pennsylvania/bethlehem/schools/? 
tableView=Overview&view=table"
page_response = requests.get(url)
content = BeautifulSoup(page_response.text,"html.parser")

table=content.find_all('table')
table

The output is: []

Thanks in advance for your help.

like image 471
ph03nix Avatar asked Mar 04 '23 19:03

ph03nix


2 Answers

you can use Selenium since it looks like the page is dynamic. You can still use beautifulsoup to parse if you'd like. When it comes to tags as tables, I choose to use pandas to read the html. you'd have to do a little work with splitting text/columns and what not in the first column which shouldn't be too hard to do.)

Let me know if this works for you.

import pandas as pd
from selenium import webdriver

url = "https://www.greatschools.org/pennsylvania/bethlehem/schools/?tableView=Overview&view=table"

driver = webdriver.Chrome('C:\chromedriver_win32\chromedriver.exe')
driver.get(url)

html = driver.page_source

table = pd.read_html(html)
df = table[0]

driver.close()

Output

print (table[0])
                                               School                       ...                                                              District
0   9/10Above averageSouthern Lehigh Intermediate ...                       ...                                       Southern Lehigh School District
1   8/10Above averageHanover El School3890 Jackson...                       ...                                        Bethlehem Area School District
2   8/10Above averageLehigh Valley Charter High Sc...                       ...                        Lehigh Valley Charter High School For The Arts
3   6/10AverageCalypso El School1021 Calypso Ave, ...                       ...                                        Bethlehem Area School District
4   6/10AverageMiller Heights El School3605 Allen ...                       ...                                        Bethlehem Area School District
5   6/10AverageAsa Packer El School1650 Kenwood Dr...                       ...                                        Bethlehem Area School District
6   6/10AverageLehigh Valley Academy Regional Cs15...                       ...                                     Lehigh Valley Academy Regional Cs
7   5/10AverageNortheast Middle School1170 Fernwoo...                       ...                                        Bethlehem Area School District
8   5/10AverageNitschmann Middle School1002 West U...                       ...                                        Bethlehem Area School District
9   5/10AverageThomas Jefferson El School404 East ...                       ...                                        Bethlehem Area School District
10  4/10Below averageJames Buchanan El School1621 ...                       ...                                        Bethlehem Area School District
11  4/10Below averageLincoln El School1260 Gresham...                       ...                                        Bethlehem Area School District
12  4/10Below averageGovernor Wolf El School1920 B...                       ...                                        Bethlehem Area School District
13  4/10Below averageSpring Garden El School901 No...                       ...                                        Bethlehem Area School District
14  4/10Below averageClearview El School2121 Abing...                       ...                                        Bethlehem Area School District
15  4/10Below averageLiberty High School1115 Linde...                       ...                                        Bethlehem Area School District
16  4/10Below averageEast Hills Middle School2005 ...                       ...                                        Bethlehem Area School District
17  4/10Below averageFreedom High School3149 Chest...                       ...                                        Bethlehem Area School District
18  3/10Below averageMarvine El School1425 Livings...                       ...                                        Bethlehem Area School District
19  3/10Below averageWilliam Penn El School1002 Ma...                       ...                                        Bethlehem Area School District
20  3/10Below averageLehigh Valley Dual Language C...                       ...                            Lehigh Valley Dual Language Charter School
21  2/10Below averageBroughal Middle School114 Wes...                       ...                                        Bethlehem Area School District
22  2/10Below averageDonegan El School1210 East 4t...                       ...                                        Bethlehem Area School District
23  2/10Below averageFountain Hill El School1330 C...                       ...                                        Bethlehem Area School District
24  Currently unratedSt. Anne School375 Hickory St...                       ...                                                                   NaN

[25 rows x 7 columns]

Now if you still wanted to use BeautifulSoup, because maybe you're trying to also pull out some of those links, or other tags within the table (maybe just getting the table isn't sufficient for what you want to do?), you can just continue as you normally would with bs4 once you get the page_response.

from bs4 import BeautifulSoup
from selenium import webdriver

url = "https://www.greatschools.org/pennsylvania/bethlehem/schools/?tableView=Overview&view=table"

driver = webdriver.Chrome('C:\chromedriver_win32\chromedriver.exe')
driver.get(url)

page_response = driver.page_source

content = BeautifulSoup(page_response,'html.parser')  
table=content.find_all('table')
table

driver.close()
like image 107
chitown88 Avatar answered Mar 08 '23 23:03

chitown88


The table is generated by Javascript but in the page source there are JSON data for that table.

To get the data you can use BeautifulSoup and json

page_response = requests.get(url)
content = BeautifulSoup(page_response.text, "html.parser")
scripts = content.find_all('script')
jsonObj = None
for script in scripts:
    if 'gon.search' in script.text:
        jsonStr = script.text.split('gon.search=')[1].split(';')
        jsonObj = json.loads(jsonStr[0])

for school in jsonObj['schools']:
    print(school['name'])

or using re and json

page_response = requests.get(url)
jsonStr = re.search(r'gon.search=(.*?);', page_response.text).group(1)
jsonObj = json.loads(jsonStr)
for school in jsonObj['schools']:
    print(school['name'])
like image 38
ewwink Avatar answered Mar 09 '23 00:03

ewwink