Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can Javascript read the source of any web page?

I am working on screen scraping, and want to retrieve the source code a particular page.

How can achieve this with javascript? Please help me.

like image 856
praveenjayapal Avatar asked Mar 25 '09 07:03

praveenjayapal


People also ask

How do you find the source code of a website using JavaScript?

For most browsers, to view inline JavaScript in the HTML source code, do one of the following. Press the Ctrl + U keyboard shortcut. Right-click an empty area on the web page and select the View page source or similar option in the pop-up menu.

Do all webpages use JavaScript?

JavaScript is the universal programming language of the web. In fact, JavaScript is used by 97.6% of all websites, according to W3Techs.

Is it legal to view source code of a website?

Viewing the source is entirely legal, you are viewing the rendered source simply by viewing the website.

Can JS read and write HTML?

If you are talking about browser javascript, you can not write data directly to local file for security reason. HTML 5 new API can only allow you to read files.


1 Answers

Simple way to start, try jQuery

$("#links").load("/Main_Page #jq-p-Getting-Started li"); 

More at jQuery Docs

Another way to do screen scraping in a much more structured way is to use YQL or Yahoo Query Language. It will return the scraped data structured as JSON or xml.
e.g.
Let's scrape stackoverflow.com

select * from html where url="http://stackoverflow.com" 

will give you a JSON array (I chose that option) like this

 "results": {    "body": {     "noscript": [      {       "div": {        "id": "noscript-padding"       }      },      {       "div": {        "id": "noscript-warning",        "p": "Stack Overflow works best with JavaScript enabled"       }      }     ],     "div": [      {       "id": "notify-container"      },      {       "div": [        {         "id": "header",         "div": [          {           "id": "hlogo",           "a": {            "href": "/",            "img": {             "alt": "logo homepage",             "height": "70",             "src": "http://i.stackoverflow.com/Content/Img/stackoverflow-logo-250.png",             "width": "250"            } …….. 

The beauty of this is that you can do projections and where clauses which ultimately gets you the scraped data structured and only the data what you need (much less bandwidth over the wire ultimately)
e.g

select * from html where url="http://stackoverflow.com" and       xpath='//div/h3/a' 

will get you

 "results": {    "a": [     {      "href": "/questions/414690/iphone-simulator-port-for-windows-closed",      "title": "Duplicate: Is any Windows simulator available to test iPhone application? as a hobbyist who cannot afford a mac, i set up a toolchain kit locally on cygwin to compile objecti … ",      "content": "iphone\n                simulator port for windows [closed]"     },     {      "href": "/questions/680867/how-to-redirect-the-web-page-in-flex-application",      "title": "I have a button control ....i need another web page to be redirected while clicking that button .... how to do that ? Thanks ",      "content": "How\n                to redirect the web page in flex application ?"     }, ….. 

Now to get only the questions we do a

select title from html where url="http://stackoverflow.com" and       xpath='//div/h3/a' 

Note the title in projections

 "results": {    "a": [     {      "title": "I don't want the function to be entered simultaneously by multiple threads, neither do I want it to be entered again when it has not returned yet. Is there any approach to achieve … "     },     {      "title": "I'm certain I'm doing something really obviously stupid, but I've been trying to figure it out for a few hours now and nothing is jumping out at me. I'm using a ModelForm so I can … "     },     {      "title": "when i am going through my project in IE only its showing errors A runtime error has occurred Do you wish to debug? Line 768 Error:Expected')' Is this is regarding any script er … "     },     {      "title": "I have a java batch file consisting of 4 execution steps written for analyzing any Java application. In one of the steps, I'm adding few libs in classpath that are needed for my co … "     },     { …… 

Once you write your query it generates a url for you

http://query.yahooapis.com/v1/public/yql?q=select%20title%20from%20html%20where%20url%3D%22http%3A%2F%2Fstackoverflow.com%22%20and%0A%20%20%20%20%20%20xpath%3D'%2F%2Fdiv%2Fh3%2Fa'%0A%20%20%20%20&format=json&callback=cbfunc

in our case.

So ultimately you end up doing something like this

var titleList = $.getJSON(theAboveUrl); 

and play with it.

Beautiful, isn’t it?

like image 115
Cherian Avatar answered Oct 18 '22 07:10

Cherian