Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy, hash tag on URLs

I'm on the middle of a scrapping project using Scrapy.

I realized that Scrapy strips the URL from a hash tag to the end.

Here's the output from the shell:

[s]   request    <GET http://www.domain.com/b?ie=UTF8&node=3006339011&ref_=pe_112320_20310580%5C#/ref=sr_nr_p_8_0?rh=n%3A165796011%2Cn%3A%212334086011%2Cn%3A%212334148011%2Cn%3A3006339011%2Cp_8%3A2229010011&bbn=3006339011&ie=UTF8&qid=1309631658&rnid=598357011>
[s]   response   <200 http://www.domain.com/b?ie=UTF8&node=3006339011&ref_=pe_112320_20310580%5C>

This really affects my scrapping because after a couple of hours trying to find out why some item was not being selected, I realized that the HTML provided by the long URL differs from the one provided by the short one. Besides, after some observation, the content changes in some critical parts.

Is there a way to modify this behavior so Scrapy keeps the whole URL?

Thanks for your feedback and suggestions.

like image 856
romeroqj Avatar asked Jul 07 '11 00:07

romeroqj


2 Answers


This isn't something scrapy itself can change--the portion following the hash in the url is the fragment identifier which is used by the client (scrapy here, usually a browser) instead of the server.

What probably happens when you fetch the page in a browser is that the page includes some JavaScript that looks at the fragment identifier and loads some additional data via AJAX and updates the page. You'll need to look at what the browser does and see if you can emulate it--developer tools like Firebug or the Chrome or Safari inspector make this easy.

For example, if you navigate to http://twitter.com/also, you are redirected to http://twitter.com/#!/also. The actual URL loaded by the browser here is just http://twitter.com/, but that page then loads data (http://twitter.com/users/show_for_profile.json?screen_name=also) which is used to generate the page, and is, in this case, just JSON data you could parse yourself. You can see this happen using the Network Inspector in Chrome.

like image 53
also Avatar answered Oct 05 '22 13:10

also


Looks like it's not possible. The problem is not the response, it's in the request, which chops the url.

It is retrievable from Javascript - as window.location.hash. From there you could send it to the server with Ajax for example, or encode it and put it into URLs which can then be passed through to the server-side.

Can I read the hash portion of the URL on my server-side application (PHP, Ruby, Python, etc.)?

Why do you need this part which is stripped if the server doesn't receive it from browser? If you are working with Amazon - i haven't seen any problems with such urls.

like image 34
warvariuc Avatar answered Oct 05 '22 12:10

warvariuc