Python

Question

Very novice to Python, but am really trying to learn it. I was playing around with scraping data from a website, and feel like I am very close to coming up with the solution. The issue is that it keeps returning only the first page of the url, even through the url in the code is changing the page number at each iteration.

The website I am using is http://etfdb.com/etf/SPY/#etf-holdings&sort_name=weight&sort_order=desc&page=1 and the specific data table I am trying to scrape is SPY Holdings (where it says 506 holdings and then lists apple, microsoft, etc.)

As you will notice, the data table has a bunch of pages (and this changes based on the ticker symbol - but for purposes of this just note that although there are 34 pages for SPY, it won't always be 34 pages). It begins by showing 15 companies, and then when you click 2 (to see the next 15 holdings) the url page= goes up by one.

#to break up html
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
import csv
import math

#goes to url - determines the number of holdings and the number of pages the data table will need to loop through
my_url = "http://etfdb.com/etf/SPY/#etf-
holdings&sort_name=weight&sort_order=desc&page=1"
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html,"html.parser")
#goes to url - scrapes from another section of the page and finds 506 holdings
num_holdings_text = page_soup.find('span',{'class': 'relative-metric-bubble-data'})
num_holdings = num_holdings_text.text
number_of_loops = int(num_holdings)
num_of_loops = number_of_loops/15
#goes to url - because the table shows 15 holdings at a time, this calcs number of pages I'll need to loop through
num_of_loops = math.ceil(num_of_loops)
holdings = []
for loop in range(1,num_of_loops+1):
    my_url = "http://etfdb.com/etf/SPY/#etf-holdings&sort_name=weight&sort_order=desc&page=" + str(loop)
    uClient = uReq(my_url)
    page_html = uClient.read()
    uClient.close()
    page_soup = soup(page_html, "html.parser")
    table = page_soup.find('table', {
    'class': 'table mm-mobile-table table-module2 table-default table-striped table-hover table-pagination'})
    table_body = table.find('tbody')
    table_rows = table_body.find_all('tr')
    for tr in table_rows:
        td = tr.find_all('td')
        row = [i.text.strip() for i in td]
        holdings.append(row)
        print(row)
    print (holdings)


    with open('etfdatapull2.csv','w',newline='') as fp:
        a = csv.writer(fp, delimiter = ',')
        a.writerows(holdings)

Again, the issue I am having is that it just is continually returning the first page (eg. it always just returns apple - GE) even though the link is updating.

Thank you so much for your help. Again, very new to this so please dumb it down as much as possible!

meisen99 · Accepted Answer

The issue is that the site you are trying to scrape actually loads the data through Javascript after the fact. If you use something like Chrome Developer tools you can see that on page 2, the site references the following link:

http://etfdb.com/data_set/?tm=1699&cond={by_etf:325}&no_null_sort=true&count_by_id=&sort=weight&order=desc&limit=15&offset=15

The data you are looking for is there; your logic is sound but you just need to scrape the link above.

If you remove the "offset" parameter, and change the limit to 1000, you'll actually get all of the data at once, and you can remove the pagination altogether.

Hope that helps!

EDIT: I should have pointed out, the page you are loading is always the same (the first set of entries, starting with AAPL), and then the data gets loaded by Javascript from the resource above. The Javascript then replaces the contents of the HTML you were scraping. Since your script looks at the original HTML (but does not execute the Javascript replacing the contents), you get the same table over and over.

Python - web scraping data table that covers multiple urls

Tags:

beautifulsoup

web-scraping

Steve Butler

1 Answers

meisen99

Recent Activity

Donate For Us