Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Incomplete data after scraping a website for Data

I am working on some web scraping using Python and experienced some issues with extracting the table values. For example, I am interested in scraping the ETFs values from http://www.etf.com/etfanalytics/etf-finder. Below is a snapshot of the tables I am trying to scrap values from.

Here is the codes which I am trying to use in the scraping.

#Import packages
import pandas as pd
import requests

#Get website url and get request
etf_list = "http://www.etf.com/etfanalytics/etf-finder"
etf_df = pd.read_html(requests.get(etf_list, headers={'User-agent': 
'Mozilla/5.0'}).text)


#printing the scraped data to screen 
print(etf_df)

# Output the read data into dataframes
for i in range(0,len(etf_df)):
    frame[i] = pd.DataFrame(etf_df[i])
    print(frame[i])

I have several issues.

  • The tables only consist of 20 entries while the total entries per table from the website should be 2166 entries. How do I amend the code to pull all the values?
  • Some of the dataframes could not be properly assigned after scraping from the site. For example, the outputs for frame[0] is not a dataframe format and nothing was seen for frame[0] when trying to view as DataFrame under the Python console. However it seems fine when printing to the screen. Would it be better if I phase the HTML using beautifulSoup instead?

ETF table

like image 640
SmokingGun Avatar asked Dec 12 '25 18:12

SmokingGun


1 Answers

As noted by Alex, the website requests the data from http://www.etf.com/etf-finder-funds-api//-aum/0/3000/1, which checks the Referer header to see if you're allowed to see it.

However, Alex is wrong in saying that you're unable to change the header.
It is in fact very easy to send custom headers using requests:

>>> r = requests.get('http://www.etf.com/etf-finder-funds-api//-aum/0/3000/1', headers={'Referer': 'http://www.etf.com/etfanalytics/etf-finder'})
>>> data = r.json()
>>> len(data)
2166

At this point, data is a dict containing all the data you need, pandas probably has a simple way of loading it into a dataframe.

like image 168
stranac Avatar answered Dec 14 '25 10:12

stranac