I'm trying to scrape a table from the NYSE website (http://www1.nyse.com/about/listed/IPO_Index.html) into a pandas dataframe. In order to do so, I have a setup like this:
def htmltodf(url):
page = requests.get(url)
soup = BeautifulSoup(page.text)
tables = soup.findAll('table')
test = pandas.io.html.read_html(str(tables))
return(test) #return dataframe type object
However, when I run this on the page, all of the table returned in the list are essentially empty. When I further investigated, I found that the table is generated by javascript. When using the developer tools in my web browser, I see that the table looks like any other HTML table with the tags, etc. However, a view of the source code revealed something like this instead:
<script language="JavaScript">
.
.
.
<script>
var year = [["ICC","21st Century Oncology Holdings, Inc.","22 May 2014","/about/listed/icc.html" ],
... more entries here ...
,["ZOES","Zoe's Kitchen, Inc.","11 Apr 2014","/about/listed/zoes.html" ]] ;
if(year.length != 0)
{
document.write ("<table width='619' border='0' cellspacing='0' cellpadding='0'><tr><td><span class='fontbold'>");
document.write ('2014' + " IPO Showcase");
document.write ("</span></td></tr></table>");
}
</script>
Therefore, when my HTML parser goes to look for the table tag, all it can find is the if condition, and no proper tags below that would indicate content. How can I scrape this table? Is there a tag that I can search for instead of table that will reveal the content? Because the code is not in traditional html table form, how do I read it in with pandas--do I have to manually parse the data?
And once you get the response, it's also very easy to parse the HTML with BeautifulSoup for example. You can use JavaScript to do web scraping if you want to scrape websites that require a lot of JavaScript to work correctly.
ScrapingBee is a web scraping API that handles headless browsers and proxies for you. ScrapingBee uses the latest headless Chrome version and supports JavaScript scripts. Like the other two middlewares, you can simply install the scrapy-scrapingbee middleware with pip.
In this case, you need something to run that javascript code for you.
One option here would be to use selenium
:
from pandas.io.html import read_html
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://www1.nyse.com/about/listed/IPO_Index.html')
table = driver.find_element_by_xpath('//div[@class="sp5"]/table//table/..')
table_html = table.get_attribute('innerHTML')
df = read_html(table_html)[0]
print df
driver.close()
prints:
0 1 2 3
0 Name Symbol NaT NaN
1 Performance Sports Group Ltd. PSG 2014-06-20 NaN
2 Century Communities, Inc. CCS 2014-06-18 NaN
3 Foresight Energy Partners LP FELP 2014-06-18 NaN
...
79 EGShares TCW EM Long Term Investment Grade Bon... LEMF 2014-01-08 NaN
80 EGShares TCW EM Short Term Investment Grade Bo... SEMF 2014-01-08 NaN
[81 rows x 4 columns]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With