Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scraping a table from a page using beautifulsoup, table is not found

I've been trying to scrape the table from here but it seems to me that BeautifulSoup doesn't find any table.

I wrote:

import requests
import pandas as pd
from bs4 import BeautifulSoup
import csv

url = "http://www.payscale.com/college-salary-report/bachelors?page=65" 
r=requests.get(url)
data=r.text

soup=BeautifulSoup(data,'xml')
table=soup.find_all('table')
print table   #prints nothing..

Based on other similar questions, I assume that the HTML is broken in someway, but I'm not an expert.. Couldn't find an answer in those: (Beautiful soup missing some html table tags), (Extracting a table from a website), (Scraping a table using BeautifulSoup), or even (Python+BeautifulSoup: scraping a particular table from a webpage)

Thanks a bunch!

like image 464
oba2311 Avatar asked Feb 01 '26 01:02

oba2311


2 Answers

You are parsing html but you used xml parser.
You should use soup=BeautifulSoup(data,"html.parser")
Your necessary data is in script tag, in fact there is no table tag actually. So, you need to find texts inside script.
N.B: If you are using Python 2.x then use "HTMLParser" instead of "html.parser".

Here is the code.

import csv
import requests
from bs4 import BeautifulSoup

url = "http://www.payscale.com/college-salary-report/bachelors?page=65" 
r=requests.get(url)
data=r.text

soup=BeautifulSoup(data,"html.parser")
scripts = soup.find_all("script")

file_name = open("table.csv","w",newline="")
writer = csv.writer(file_name)
list_to_write = []

list_to_write.append(["Rank","School Name","School Type","Early Career Median Pay","Mid-Career Median Pay","% High Job Meaning","% STEM"])

for script in scripts:
    text = script.text
    start = 0
    end = 0
    if(len(text) > 10000):
        while(start > -1):
            start = text.find('"School Name":"',start)
            if(start == -1):
                break
            start += len('"School Name":"')
            end = text.find('"',start)
            school_name = text[start:end]

            start = text.find('"Early Career Median Pay":"',start)
            start += len('"Early Career Median Pay":"')
            end = text.find('"',start)
            early_pay = text[start:end]

            start = text.find('"Mid-Career Median Pay":"',start)
            start += len('"Mid-Career Median Pay":"')
            end = text.find('"',start)
            mid_pay = text[start:end]

            start = text.find('"Rank":"',start)
            start += len('"Rank":"')
            end = text.find('"',start)
            rank = text[start:end]

            start = text.find('"% High Job Meaning":"',start)
            start += len('"% High Job Meaning":"')
            end = text.find('"',start)
            high_job = text[start:end]

            start = text.find('"School Type":"',start)
            start += len('"School Type":"')
            end = text.find('"',start)
            school_type = text[start:end]

            start = text.find('"% STEM":"',start)
            start += len('"% STEM":"')
            end = text.find('"',start)
            stem = text[start:end]

            list_to_write.append([rank,school_name,school_type,early_pay,mid_pay,high_job,stem])
writer.writerows(list_to_write)
file_name.close()

This will generate your necessary table in csv. Don't forget to close the file when you are done.

like image 169
MD. Khairul Basar Avatar answered Feb 03 '26 15:02

MD. Khairul Basar


While this won't find the table that's not in r.text, you are asking BeautifulSoup to use the xml parser instead of html.parser so I would recommend changing that line to:

soup=BeautifulSoup(data,'html.parser')

One of the issues you will run into with web scraping is what are called "client-rendered" websites versus server-rendered. Basically, this means that the page you would get from a basic html request through the requests module or through curl for example is not the same content that would be rendered in a web browser. Some of the common frameworks for this are React and Angular. If you examine the source of the page you are wanting to scrape, they have data-react-ids on several of their html elements. A common tell for Angular pages are similar element attributes with the prefix ng, e.g. ng-if or ng-bind. You can see the page's source in Chrome or Firefox through their respective dev tools, which can be launched with the keyboard shortcut Ctrl+Shift+I in either browser. It's worth noting that not all React & Angular pages are only client-rendered.

In order to get this sort of content, you would need to use a headless browser tool like Selenium. There are many resources on web scraping with Selenium and Python.

like image 23
metame Avatar answered Feb 03 '26 15:02

metame



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!