Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

IMDB Scraping issue [duplicate]

Tags:

php

curl

Possible Duplicates:
Does IMDB provide an API?
How to send a header using a HTTP request through a curl call?

I am using PHP curl to scrape movie details from IMDB. It works perfectly in fetching data but the problem i am facing right now is:

When I fetch non English movies like this movie.

When I open this movie in my browser then it shows me "IMDB English"-version page of this movie which shows movie name "Boarding School". But when i fetch the data through curl then it fetch the original page for this movie where the movie name is "Leidenschaftliche Blümchen".

So please suggest me how to fetch the curl data in English version IMDB page.

like image 602
pravat231 Avatar asked Oct 24 '22 03:10

pravat231


1 Answers

When you request a page with a Browser, the Browser sends specific request headers to the server. A firefox extension like firebug can show these (check Net), these are exemplary the headers I just send over to the server with firefox:

GET /title/tt0076306/ HTTP/1.1
Host: www.imdb.com
User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.8,de-de;q=0.5,de;q=0.3
Accept-Encoding: gzip, deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Connection: keep-alive
...

The one that makes a difference possibly:

Accept-Language: en-us,en;q=0.8,de-de;q=0.5,de;q=0.3

See 14.4 Accept-Language.

When you use curl, it will send specific request headers as well but they might differ. However you can command curl to use the headers you specifiy, too.

You just need to make curl use the headers your browser uses and you should get the same result. See How to send a header using a HTTP request through a curl call?.

For getting the german version of the page for example:

curl -H "Accept-Language: de-de;q=0.8,de;q=0.5" http://www.imdb.com/title/tt0076306/

For the english version:

curl -H "Accept-Language: en-us,en;q=0.8,de-de;q=0.5,de;q=0.3" http://www.imdb.com/title/tt0076306/
like image 123
hakre Avatar answered Oct 27 '22 11:10

hakre