I am trying to scrape http://www.dailyfinance.com/quote/NYSE/international-business-machines/IBM/financial-ratios, but the traditional url string building technique doesn't work because the "full-company-name-is-inserted-in-the-path" string. And the exact "full-company-name" isn't known in advance. Only the company symbol, "IBM" is known. Essentially, the way I scrape is by looping through an array of company symbol and build the url string before sending it to urllib2.urlopen(url). But in this case, that can't be done. For example, CSCO string is <pre class="prettyprint"><code>http://www.dailyfinance.com/quote/NASDAQ/cisco-systems-inc/CSCO/financial-ratios </code></pre> and another example url string is AAPL: <pre class="prettyprint"><code>http://www.dailyfinance.com/quote/NASDAQ/apple/AAPL/financial-ratios </code></pre> So in order to get the url, I had to search the symbol in the input box on the main page: <pre class="prettyprint"><code>http://www.dailyfinance.com/ </code></pre> I've noticed that when I type "CSCO" and inspect the search input at (http://www.dailyfinance.com/quote/NASDAQ/apple/AAPL/financial-ratios in Firefox web developer network tab, I noticed that the get request is sending to <pre class="prettyprint"><code>http://j.foolcdn.com/tmf/predictivesearch?callback=_predictiveSearch_csco&term=csco&domain=dailyfinance.com </code></pre> and that the referer actually gives the path that I want to capture <pre class="prettyprint"><code>Host: j.foolcdn.com User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:28.0) Gecko/20100101 Firefox/28.0 Accept: */* Accept-Language: en-US,en;q=0.5 Accept-Encoding: gzip, deflate Referer: http://www.dailyfinance.com/quote/NASDAQ/cisco-systems-inc/CSCO/financial-ratios?source=itxwebtxt0000007 Connection: keep-alive </code></pre> Sorry for the long explanation. So the question is how do I extract the url in the Referer? If that is not possible, how should I approach this problem? Is there another way? I really appreciate your help.

I like this question. And because of that, I'll give a very thorough answer. For this, I'll use my favorite Requests library along with BeautifulSoup4. Porting over to Mechanize if you really want to use that is up to you. Requests will save you tons of headaches though. <hr> First off, you're probably looking for a POST request. However, POST requests are often not needed if a search function brings you right away to the page you're looking for. So let's inspect it, shall we? When I land on the base URL, <code>http://www.dailyfinance.com/</code>, I can do a simple check via Firebug or Chrome's inspect tool that when I put in CSCO or AAPL on the search bar and enable the "jump", there's a <code>301 Moved Permanently</code> status code. What does this mean? <img src="https://i.stack.imgur.com/nPXcC.png" alt="enter image description here"> In simple terms, I was transferred somewhere. The URL for this GET request is the following: <pre class="prettyprint"><code>http://www.dailyfinance.com/quote/jump?exchange-input=&ticker-input=CSCO </code></pre> Now, we test if it works with AAPL by using a simple URL manipulation. <pre class="prettyprint"><code>import requests as rq apl_tick = "AAPL" url = "http://www.dailyfinance.com/quote/jump?exchange-input=&ticker-input=" r = rq.get(url + apl_tick) print r.url </code></pre> The above gives the following result: <pre class="prettyprint"><code>http://www.dailyfinance.com/quote/nasdaq/apple/aapl [Finished in 2.3s] </code></pre> See how the URL of the response changed? Let's take the URL manipulation one step further by looking for the <code>/financial-ratios</code> page by appending the below to the above code: <pre class="prettyprint"><code>new_url = r.url + "/financial-ratios" p = rq.get(new_url) print p.url </code></pre> When ran, this gives is the following result: <pre class="prettyprint"><code>http://www.dailyfinance.com/quote/nasdaq/apple/aapl http://www.dailyfinance.com/quote/nasdaq/apple/aapl/financial-ratios [Finished in 6.0s] </code></pre> Now we're on the right track. I will now try to parse the data using BeautifulSoup. My complete code is as follows: <pre class="prettyprint"><code>from bs4 import BeautifulSoup as bsoup import requests as rq apl_tick = "AAPL" url = "http://www.dailyfinance.com/quote/jump?exchange-input=&ticker-input=" r = rq.get(url + apl_tick) new_url = r.url + "/financial-ratios" p = rq.get(new_url) soup = bsoup(p.content) div = soup.find("div", id="clear").table rows = table.find_all("tr") for row in rows: print row </code></pre> I then try running this code, only to encounter an error with the following traceback: <pre class="prettyprint"><code> File "C:\Users\nanashi\Desktop\test.py", line 13, in <module> div = soup.find("div", id="clear").table AttributeError: 'NoneType' object has no attribute 'table' </code></pre> Of note is the line <code>'NoneType' object...</code>. This means our target <code>div</code> does not exist! Egads, but why am I seeing the following?! <img src="https://i.stack.imgur.com/lGI8O.png" alt="enter image description here"> There can only be one explanation: the table is loaded dynamically! Rats. Let's see if we can find another source for the table. I study the page and see that there are scrollbars at the bottom. This might mean that the table was loaded inside a frame or was loaded straight from another source entirely and placed into a <code>div</code> in the page. I refresh the page and watch the GET requests again. Bingo, I found something that seems a bit promising: <img src="https://i.stack.imgur.com/T4us5.png" alt="enter image description here"> A third-party source URL, and look, it's easily manipulable using the ticker symbol! Let's try loading it into a new tab. Here's what we get: <img src="https://i.stack.imgur.com/XFA3z.png" alt="enter image description here"> WOW! We now have the very exact source of our data. The last hurdle though is will it work when we try to pull the CSCO data using this string (remember we went CSCO -> AAPL and now back to CSCO again, so you're not confused). Let's clean up the string and ditch the role of <code>www.dailyfinance.com</code> here completely. Our new url is as follows: <pre class="prettyprint"><code>http://www.motleyfool.idmanagedsolutions.com/stocks/financial_ratios.idms?SYMBOL_US=AAPL </code></pre> Let's try using that in our final scraper! <pre class="prettyprint"><code>from bs4 import BeautifulSoup as bsoup import requests as rq csco_tick = "CSCO" url = "http://www.motleyfool.idmanagedsolutions.com/stocks/financial_ratios.idms?SYMBOL_US=" new_url = url + csco_tick r = rq.get(new_url) soup = bsoup(r.content) table = soup.find("div", id="clear").table rows = table.find_all("tr") for row in rows: print row.get_text() </code></pre> And our raw results for CSCO's financial ratios data is as follows: <pre class="prettyprint"><code>Company Industry Valuation Ratios P/E Ratio (TTM) 15.40 14.80 P/E High - Last 5 Yrs 24.00 28.90 P/E Low - Last 5 Yrs 8.40 12.10 Beta 1.37 1.50 Price to Sales (TTM) 2.51 2.59 Price to Book (MRQ) 2.14 2.17 Price to Tangible Book (MRQ) 4.25 3.83 Price to Cash Flow (TTM) 11.40 11.60 Price to Free Cash Flow (TTM) 28.20 60.20 Dividends Dividend Yield (%) 3.30 2.50 Dividend Yield - 5 Yr Avg (%) N.A. 1.20 Dividend 5 Yr Growth Rate (%) N.A. 144.07 Payout Ratio (TTM) 45.00 32.00 Sales (MRQ) vs Qtr 1 Yr Ago (%) -7.80 -3.70 Sales (TTM) vs TTM 1 Yr Ago (%) 5.50 5.60 Growth Rates (%) Sales - 5 Yr Growth Rate (%) 5.51 5.12 EPS (MRQ) vs Qtr 1 Yr Ago (%) -54.50 -51.90 EPS (TTM) vs TTM 1 Yr Ago (%) -54.50 -51.90 EPS - 5 Yr Growth Rate (%) 8.91 9.04 Capital Spending - 5 Yr Growth Rate (%) 20.30 20.94 Financial Strength Quick Ratio (MRQ) 2.40 2.70 Current Ratio (MRQ) 2.60 2.90 LT Debt to Equity (MRQ) 0.22 0.20 Total Debt to Equity (MRQ) 0.31 0.25 Interest Coverage (TTM) 18.90 19.10 Profitability Ratios (%) Gross Margin (TTM) 63.20 62.50 Gross Margin - 5 Yr Avg 66.30 64.00 EBITD Margin (TTM) 26.20 25.00 EBITD - 5 Yr Avg 28.82 0.00 Pre-Tax Margin (TTM) 21.10 20.00 Pre-Tax Margin - 5 Yr Avg 21.60 18.80 Management Effectiveness (%) Net Profit Margin (TTM) 17.10 17.65 Net Profit Margin - 5 Yr Avg 17.90 15.40 Return on Assets (TTM) 8.30 8.90 Return on Assets - 5 Yr Avg 8.90 8.00 Return on Investment (TTM) 11.90 12.30 Return on Investment - 5 Yr Avg 12.50 10.90 Efficiency Revenue/Employee (TTM) 637,890.00 556,027.00 Net Income/Employee (TTM) 108,902.00 98,118.00 Receivable Turnover (TTM) 5.70 5.80 Inventory Turnover (TTM) 11.30 9.70 Asset Turnover (TTM) 0.50 0.50 [Finished in 2.0s] </code></pre> Cleaning up the data is up to you. <hr> One good lesson to learn from this scrape is not all data are contained in one page alone. It's pretty nice to see it coming from another static site. If it was produced via JavaScript or AJAX calls or the like, we would likely have some difficulties with our approach. Hopefully you learned something from this. Let us know if this helps and good luck.

Doesn't answer your specific question, but solves your problem. <code>http://www.dailyfinance.com/quotes/{Company Symbol}/{Stock Exchange}</code> Examples: http://www.dailyfinance.com/quotes/AAPL/NAS http://www.dailyfinance.com/quotes/IBM/NYSE http://www.dailyfinance.com/quotes/CSCO/NAS To get to the financial ratios page you could then employ something like this: <pre class="prettyprint"><code>import urllib2 def financial_ratio_url(symbol, stock_exchange): starturl = 'http://www.dailyfinance.com/quotes/' starturl += '/'.join([symbol, stock_exchange]) req = urllib2.Request(starturl) res = urllib2.urlopen(starturl) return '/'.join([res.geturl(),'financial-ratios']) </code></pre> Example: <pre class="prettyprint"><code>financial_ratio_url('AAPL', 'NAS') 'http://www.dailyfinance.com/quote/nasdaq/apple/aapl/financial-ratios' </code></pre>

How do I scrape pages with dynamically generated URLs using Python?

Tags:

python

beautifulsoup

urllib2

web-scraping

I am trying to scrape http://www.dailyfinance.com/quote/NYSE/international-business-machines/IBM/financial-ratios, but the traditional url string building technique doesn't work because the "full-company-name-is-inserted-in-the-path" string. And the exact "full-company-name" isn't known in advance. Only the company symbol, "IBM" is known.

Essentially, the way I scrape is by looping through an array of company symbol and build the url string before sending it to urllib2.urlopen(url). But in this case, that can't be done.

For example, CSCO string is

http://www.dailyfinance.com/quote/NASDAQ/cisco-systems-inc/CSCO/financial-ratios

and another example url string is AAPL:

http://www.dailyfinance.com/quote/NASDAQ/apple/AAPL/financial-ratios

So in order to get the url, I had to search the symbol in the input box on the main page:

http://www.dailyfinance.com/

I've noticed that when I type "CSCO" and inspect the search input at (http://www.dailyfinance.com/quote/NASDAQ/apple/AAPL/financial-ratios in Firefox web developer network tab, I noticed that the get request is sending to

http://j.foolcdn.com/tmf/predictivesearch?callback=_predictiveSearch_csco&term=csco&domain=dailyfinance.com

and that the referer actually gives the path that I want to capture

Host: j.foolcdn.com
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:28.0) Gecko/20100101 Firefox/28.0
Accept: */*
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Referer: http://www.dailyfinance.com/quote/NASDAQ/cisco-systems-inc/CSCO/financial-ratios?source=itxwebtxt0000007
Connection: keep-alive

Sorry for the long explanation. So the question is how do I extract the url in the Referer? If that is not possible, how should I approach this problem? Is there another way?

I really appreciate your help.

616

asked Apr 25 '14 20:04

vt2424253

2 Answers

I like this question. And because of that, I'll give a very thorough answer. For this, I'll use my favorite Requests library along with BeautifulSoup4. Porting over to Mechanize if you really want to use that is up to you. Requests will save you tons of headaches though.

First off, you're probably looking for a POST request. However, POST requests are often not needed if a search function brings you right away to the page you're looking for. So let's inspect it, shall we?

When I land on the base URL, http://www.dailyfinance.com/, I can do a simple check via Firebug or Chrome's inspect tool that when I put in CSCO or AAPL on the search bar and enable the "jump", there's a 301 Moved Permanently status code. What does this mean?

enter image description here

In simple terms, I was transferred somewhere. The URL for this GET request is the following:

http://www.dailyfinance.com/quote/jump?exchange-input=&ticker-input=CSCO

Now, we test if it works with AAPL by using a simple URL manipulation.

import requests as rq

apl_tick = "AAPL"
url = "http://www.dailyfinance.com/quote/jump?exchange-input=&ticker-input="
r = rq.get(url + apl_tick)
print r.url

The above gives the following result:

http://www.dailyfinance.com/quote/nasdaq/apple/aapl
[Finished in 2.3s]

See how the URL of the response changed? Let's take the URL manipulation one step further by looking for the /financial-ratios page by appending the below to the above code:

new_url = r.url + "/financial-ratios"
p = rq.get(new_url)
print p.url

When ran, this gives is the following result:

http://www.dailyfinance.com/quote/nasdaq/apple/aapl
http://www.dailyfinance.com/quote/nasdaq/apple/aapl/financial-ratios
[Finished in 6.0s]

Now we're on the right track. I will now try to parse the data using BeautifulSoup. My complete code is as follows:

from bs4 import BeautifulSoup as bsoup
import requests as rq

apl_tick = "AAPL"
url = "http://www.dailyfinance.com/quote/jump?exchange-input=&ticker-input="
r = rq.get(url + apl_tick)
new_url = r.url + "/financial-ratios"
p = rq.get(new_url)

soup = bsoup(p.content)
div = soup.find("div", id="clear").table
rows = table.find_all("tr")
for row in rows:
    print row

I then try running this code, only to encounter an error with the following traceback:

  File "C:\Users\nanashi\Desktop\test.py", line 13, in <module>
    div = soup.find("div", id="clear").table
AttributeError: 'NoneType' object has no attribute 'table'

Of note is the line 'NoneType' object.... This means our target div does not exist! Egads, but why am I seeing the following?!

enter image description here

There can only be one explanation: the table is loaded dynamically! Rats. Let's see if we can find another source for the table. I study the page and see that there are scrollbars at the bottom. This might mean that the table was loaded inside a frame or was loaded straight from another source entirely and placed into a div in the page.

I refresh the page and watch the GET requests again. Bingo, I found something that seems a bit promising:

enter image description here

A third-party source URL, and look, it's easily manipulable using the ticker symbol! Let's try loading it into a new tab. Here's what we get:

enter image description here

WOW! We now have the very exact source of our data. The last hurdle though is will it work when we try to pull the CSCO data using this string (remember we went CSCO -> AAPL and now back to CSCO again, so you're not confused). Let's clean up the string and ditch the role of www.dailyfinance.com here completely. Our new url is as follows:

http://www.motleyfool.idmanagedsolutions.com/stocks/financial_ratios.idms?SYMBOL_US=AAPL

Let's try using that in our final scraper!

from bs4 import BeautifulSoup as bsoup
import requests as rq

csco_tick = "CSCO"
url = "http://www.motleyfool.idmanagedsolutions.com/stocks/financial_ratios.idms?SYMBOL_US="
new_url = url + csco_tick

r = rq.get(new_url)
soup = bsoup(r.content)

table = soup.find("div", id="clear").table
rows = table.find_all("tr")
for row in rows:
    print row.get_text()

And our raw results for CSCO's financial ratios data is as follows:

Company
Industry


Valuation Ratios


P/E Ratio (TTM)
15.40
14.80


P/E High - Last 5 Yrs 
24.00
28.90


P/E Low - Last 5 Yrs
8.40
12.10


Beta
1.37
1.50


Price to Sales (TTM)
2.51
2.59


Price to Book (MRQ)
2.14
2.17


Price to Tangible Book (MRQ)
4.25
3.83


Price to Cash Flow (TTM)
11.40
11.60


Price to Free Cash Flow (TTM)
28.20
60.20


Dividends


Dividend Yield (%)
3.30
2.50


Dividend Yield - 5 Yr Avg (%)
N.A.
1.20


Dividend 5 Yr Growth Rate (%)
N.A.
144.07


Payout Ratio (TTM)
45.00
32.00


Sales (MRQ) vs Qtr 1 Yr Ago (%)
-7.80
-3.70


Sales (TTM) vs TTM 1 Yr Ago (%)
5.50
5.60


Growth Rates (%)


Sales - 5 Yr Growth Rate (%)
5.51
5.12


EPS (MRQ) vs Qtr 1 Yr Ago (%)
-54.50
-51.90


EPS (TTM) vs TTM 1 Yr Ago (%)
-54.50
-51.90


EPS - 5 Yr Growth Rate (%)
8.91
9.04


Capital Spending - 5 Yr Growth Rate (%)
20.30
20.94


Financial Strength


Quick Ratio (MRQ)
2.40
2.70


Current Ratio (MRQ)
2.60
2.90


LT Debt to Equity (MRQ)
0.22
0.20


Total Debt to Equity (MRQ)
0.31
0.25


Interest Coverage (TTM)
18.90
19.10


Profitability Ratios (%)


Gross Margin (TTM)
63.20
62.50


Gross Margin - 5 Yr Avg
66.30
64.00


EBITD Margin (TTM)
26.20
25.00


EBITD - 5 Yr Avg
28.82
0.00


Pre-Tax Margin (TTM)
21.10
20.00


Pre-Tax Margin - 5 Yr Avg
21.60
18.80


Management Effectiveness (%)


Net Profit Margin (TTM)
17.10
17.65


Net Profit Margin - 5 Yr Avg
17.90
15.40


Return on Assets (TTM)
8.30
8.90


Return on Assets - 5 Yr Avg
8.90
8.00


Return on Investment (TTM)
11.90
12.30


Return on Investment - 5 Yr Avg
12.50
10.90


Efficiency


Revenue/Employee (TTM)
637,890.00
556,027.00


Net Income/Employee (TTM)
108,902.00
98,118.00


Receivable Turnover (TTM)
5.70
5.80


Inventory Turnover (TTM)
11.30
9.70


Asset Turnover (TTM)
0.50
0.50

[Finished in 2.0s]

Cleaning up the data is up to you.

One good lesson to learn from this scrape is not all data are contained in one page alone. It's pretty nice to see it coming from another static site. If it was produced via JavaScript or AJAX calls or the like, we would likely have some difficulties with our approach.

Hopefully you learned something from this. Let us know if this helps and good luck.

answered Oct 19 '22 15:10

NullDev

Doesn't answer your specific question, but solves your problem.

http://www.dailyfinance.com/quotes/{Company Symbol}/{Stock Exchange}

Examples:

http://www.dailyfinance.com/quotes/AAPL/NAS

http://www.dailyfinance.com/quotes/IBM/NYSE

http://www.dailyfinance.com/quotes/CSCO/NAS

To get to the financial ratios page you could then employ something like this:

import urllib2

def financial_ratio_url(symbol, stock_exchange):
    starturl  = 'http://www.dailyfinance.com/quotes/'
    starturl += '/'.join([symbol, stock_exchange])
    req = urllib2.Request(starturl)
    res = urllib2.urlopen(starturl)
    return '/'.join([res.geturl(),'financial-ratios'])

Example:

financial_ratio_url('AAPL', 'NAS')
'http://www.dailyfinance.com/quote/nasdaq/apple/aapl/financial-ratios'

answered Oct 19 '22 17:10

cdhagmann

Related questions
                            
                                How can I debug POST requests with python's BaseHTTPServer / SimpleHTTPServer?
                            
                                Efficient scheduling of university courses
                            
                                Bottle.py HTTP Auth?
                            
                                Why is recursion in python so slow?
                            
                                How can I use an app-factory in Flask / WSGI servers and why might it be unsafe?
                            
                                API in Flask--returns JSON but HTML exceptions break my JSON client
                            
                                Django: How can I check the last activity time of user if user didn't log out?
                            
                                AttributeError: 'module' object has no attribute
                            
                                Assigning column names from a list to a table
                            
                                convert a 2D numpy array to a 2D numpy matrix
                            
                                PyImport_Import fails (returns NULL)
                            
                                How can we get tweets from specific country
                            
                                Linear regression with pandas dataframe
                            
                                matplotlib plot set x_ticks
                            
                                Case Insensitive Python string split() method
                            
                                Change python mro at runtime
                            
                                How can I call super() so it's compatible in 2 and 3?
                            
                                Finding index of maximum value in array with NumPy
                            
                                Check element type in BeautifulSoup 3
                            
                                Convert list of strings to dictionary

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I scrape pages with dynamically generated URLs using Python?

Tags:

python

beautifulsoup

urllib2

web-scraping

vt2424253

People also ask

2 Answers

NullDev

cdhagmann

Recent Activity

Donate For Us