Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python 3, Web-scraping, and Javascript [Oh My]

I have come to the point of entering the melee on web-scraping webpages using Javascript, with Python3. I am well aware that my boot may be making contact with a dead horse, but I feel like drawing my six-shooter anyway. It's a spaghetti western; be my gray hat?

::Backstory::

I am using Python 3.2.3.

I am interested in gathering historical stock//etf//mutual_fund price data for YTD, 1-yr, 3-yr, 5-yr 10-yr... and/or similar timeframes for a user-defined stock, etf, or mutual fund. I set my sites on Morningstar.com, as they tend to provide as much data as possible without necessarily requiring a log-in; other folks such as finance.google.com &c tend to be inconsistent in what data they provide regarding stocks vs etfs vs mutual funds.

The trade-off in using Morningstar for this historical data, or "Trailing Total Returns" as they call it, is that for producing this data they use Javascript.

Here are some example links from Morningstar:

A Mutual Fund;

An ETF;

A Stock.

I am interested in the "Trailing Returns" portion, top row or so of numbers in the Javascript-produced chart.

::Attempted So Far::

I've confirmed that wget doesn't play with Javascript; even downloading all of the associated files [css, .js, &c] hasn't allowed me to locally render the javascript in browser or in script. Research here on StackOverflow confirmed this. Am willing to be corrected here.

My research informed me that Mechanize doesn't exist for Python3. I tried anyway, and turned into Policeman Javert crying out "I knew it!" at the error message "module does not exist".

::I've Heard Of...::

->Selenium. However, my understanding is that this requires Thy Favorite Browser to actually open up a webpage, navigate around, and then not close because there's no "close this tab//window" command//option for Selenium. What if I//my_user want to get historical data for many etfs, stocks, and/or mutual funds? That's a lot of tabs//windows opening up in a browser which was not necessarily desired to be opened.

->httplib2. I think this is nice, but I'm doubtful if it will play with Javascript. Does it, for example using the .cache and get options?

import httplib2
conn = httplib2.Http(".cache")
page = conn.request(u"http://the_url","GET")

->Windmill. See 'Selenium'. I am, however, off-key enough to sing 'Man of La Mancha'.

->Google's webscraping code. Would an attempt at downloading a Javascript-laden page result in ... positive results?

I've read chatter about having to "emulating a browser without a browser". Sounds like Mechanize, but not for Python3 as I currently understand.

::My Question::

Any suggestions, pointers, solutions, or "look over here" directions?

Many thanks,

Miles, Dusty Desert Villager.

like image 691
MilesNielsen Avatar asked Dec 21 '22 18:12

MilesNielsen


1 Answers

When a page loads data via javascript, it has to make requests to the server to get that data via the XMLHttpRequest function (XHR). You can see what requests they are making, and then make them yourself, using wget!

To find out which requests they are making, use the Web Inspector (Chrome and Safari) or Firebug (Firefox). Here's how to do it in Chrome:

wrench/tools/developer tools/Network (tab at the top of the tools)/XHR filter at the bottom.

Here's an example request they make in javascript

If you look closely at the XHR request url, you notice that all trailing returns have the same format:

http://performance.morningstar.com/Performance/cef/trailing-total-returns.action?t=

You just need to specify t. For example:

http://performance.morningstar.com/Performance/cef/trailing-total-returns.action?t=VAW http://performance.morningstar.com/Performance/cef/trailing-total-returns.action?t=INTC http://performance.morningstar.com/Performance/cef/trailing-total-returns.action?t=VHCOX

Now you can wget those URIs and parse out the data directly.

like image 129
Cypress Frankenfeld Avatar answered Dec 30 '22 05:12

Cypress Frankenfeld