Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding the web address of a Javascript link

Tags:

javascript

r

I am using RCurl in R to try and download data from a website, but I'm having trouble finding out what URL to use. Here is the site:

http://www.invescopowershares.com/products/holdings.aspx?ticker=PGX

See how in the upper right, above the displayed sheet, there's a link to download the data as a .csv file? I was wondering if there was a way to find a regular HTTP address for that .csv file, because RCurl can't handle the Javascript commands.

like image 593
Andrew Page Avatar asked Aug 06 '10 18:08

Andrew Page


People also ask

How do I find a URL address?

On your computer, go to google.com. Search for the page. In search results, click the title of the page. At the top of your browser, click the address bar to select the entire URL.

What is URL in JavaScript?

URL() The URL() constructor returns a newly created URL object representing the URL defined by the parameters. If the given base URL or the resulting URL are not valid URLs, the JavaScript TypeError exception is thrown. Note: This feature is available in Web Workers.

How do I find the URL of my browser?

Get a page URLIn search results, tap the title of the page. Copy the URL based on your browser: Chrome: Tap the address bar. Below the address bar, next to the page URL, tap Copy .


1 Answers

I will give you a quick and dirty way to get the data. Firstly you can use Fiddler2 http://www.fiddler2.com/fiddler2/ to inspect the POST that your browser sends. This results in the following POST:

POST http://www.invescopowershares.com/products/holdings.aspx?ticker=PGX HTTP/1.1
Host: www.invescopowershares.com
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:13.0) Gecko/20100101 Firefox/13.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip, deflate
DNT: 1
Connection: keep-alive
Referer: http://www.invescopowershares.com/products/holdings.aspx?ticker=PGX
Content-Type: application/x-www-form-urlencoded
Content-Length: 70669

__EVENTTARGET=ctl00%24MainPageLeft%24MainPageContent%24ExportHoldings1%24LinkButton1&__EVENTARGUMENT=&__VIEWSTATE=%2FwEPDwUKLTE1OTcxNjYzNw9kFgJmD2QWBAIDD2QWBAIDD2QWCAIBDw9kFgQeC2........

So we can see that 3 parameters are being posted namely __EVENTTARGET, __EVENTVALIDATION and __VIEWSTATE.

The required form for the postForm call would be:

postForm(ftarget, "form name" = "aspnetForm", "method" = "POST", "action" = "holdings.aspx?ticker=PGX", "id" = "aspnetForm","__EVENTTARGET"=event.target,"__EVENTVALIDATION"=event.val,"__VIEWSTATE"=view.state)

Now comes the quick and dirty bit. I would just open a browser and get the relevant parameters that it recieves as follows:

library(rcom)
ie = comCreateObject('InternetExplorer.Application')
ie[["visible"]]=T # true for debugging
ie$Navigate2("http://www.invescopowershares.com/products/holdings.aspx?ticker=PGX")
while(comGetProperty(ie,"busy")||comGetProperty(ie,"ReadyState")<4){
Sys.sleep(1)
print(comGetProperty(ie,"ReadyState"))
}
myDoc<-comGetProperty(ie,"Document")
myPW<-comGetProperty(myDoc,"parentWindow")
comInvoke(myPW,"execScript","var dumVar1=theForm.__EVENTVALIDATION.value;var dumVar2=theForm.__VIEWSTATE.value;","JavaScript")
event.val<-myPW[["dumVar1"]]
view.state<-myPW[["dumVar2"]]
event.target<-"ctl00$MainPageLeft$MainPageContent$ExportHoldings1$LinkButton1"
ie$Quit()
ftarget<-"http://www.invescopowershares.com/products/holdings.aspx?ticker=PGX"
web.data<-postForm(ftarget, "form name" = "aspnetForm", "method" = "POST", "action" = "holdings.aspx?ticker=PGX", "id" = "aspnetForm","__EVENTTARGET"=event.target,"__EVENTVALIDATION"=event.val,"__VIEWSTATE"=view.state)
write(web.data[1],'temp.csv')
fin.data<-read.csv('temp.csv')


> fin.data[1,]
  ticker SecurityNum                      Name CouponRate maturitydate
1    PGX   949746879 WELLS FARGO & COMPANY PFD       0.08             
     rating  Shares PercentageOfFund PositionDate
1 BBB+/Baa3 2538656       0.04442112   06/11/2012

__EVENTVALIDATION, __VIEWSTATE maybe be always the same or they maybe session cookies. You could probably get them using RCurl but as I say this is the quick and dirty solution and we just take the ones that Internet Explorer is given. Things to note:

1). This requires windows with IE installed to use the rcom bits.

2). If you are running ie9 you may need to add invescopowershares.com to the Compatibility View Settings (as microsoft seems to have blocked event.val<-myPW[["dumVar1"]] type com calls)

EDIT (UPDATE)

Having looked thru the website in more detail __EVENTVALIDATION, __VIEWSTATE are being set as javascript variables on the initial page. We can just parse these in a quick and dirty fashion as follows without resorting to calling a browser.

dum<-getURL("http://www.invescopowershares.com/products/holdings.aspx?ticker=PGX")
event.target<-"ctl00$MainPageLeft$MainPageContent$ExportHoldings1$LinkButton1"
event.val<-unlist(strsplit(dum,"__EVENTVALIDATION\" value=\""))[2]
event.val<-unlist(strsplit(event.val,"\" />\r\n\r\n<script"))[1]
view.state<-unlist(strsplit(dum,"id=\"__VIEWSTATE\" value=\""))[2]
view.state<-unlist(strsplit(view.state,"\" />\r\n\r\n\r\n<script"))[1]
ftarget<-"http://www.invescopowershares.com/products/holdings.aspx?ticker=PGX"
web.data<-postForm(ftarget, "form name" = "aspnetForm", "method" = "POST", "action" = "holdings.aspx?ticker=PGX", "id" = "aspnetForm","__EVENTTARGET"=event.target,"__EVENTVALIDATION"=event.val,"__VIEWSTATE"=view.state)
write(web.data[1],'temp.csv')
fin.data<-read.csv('temp.csv')

The above should work cross platform.

like image 186
shhhhimhuntingrabbits Avatar answered Oct 07 '22 12:10

shhhhimhuntingrabbits