Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PHP Site Scraping With a Secure Login

Tags:

php

I am trying to scrape the quantity of items one of my distributors has in stock per product. They do not know how to export this data. So I am wondering if someone could help point me in the proper direction on how to scrape a site with PHP that you have to log into to get to the data? It's not a secure site with SSL.

Thanks for any tips,

Chris Edwards

like image 214
Chris Edwards Avatar asked Dec 01 '10 18:12

Chris Edwards


People also ask

How do you scrape data from a website that requires a login?

ParseHub is a free and powerful web scraper that can log in to any site before it starts scraping data. You can then set it up to extract the specific data you want and download it all to an Excel or JSON file. To get started, make sure you download and install ParseHub for free.

Is PHP good for web scraping?

Web scraping lets you collect data from web pages across the internet. It's also called web crawling or web data extraction. PHP is a widely used back-end scripting language for creating dynamic websites and web applications. And you can implement a web scraper using plain PHP code.

Can I scrape a website legally?

Web scraping and crawling aren't illegal by themselves. After all, you could scrape or crawl your own website, without a hitch. Startups love it because it's a cheap and powerful way to gather data without the need for partnerships.

Can you get in trouble for scraping a website?

In short, the action of web scraping isn't illegal. However there are some rules that need to be followed. Web scraping becomes illegal when non publicly available data becomes extracted.


2 Answers

The easiest way to get where you want is by utilizing cURL. cURL's base feature is that it allows you to make an HTTP request configured how you need it and receive the response. This can be done in various degrees of detail, depending on your needs.

What you want to do is basically make a HTTP request to get the page you want and scrape the data out of the response's HTML. This can be very easy to do, but in your case you will need to overcome some obstacles.

I 'm assuming that by saying "have to log in" you mean there's a login form you have to get past before being able to scrape anything. cURL can fake a login with a little help on your part.

First of all, you will need to "submit" the login form with cURL just as you would do by hand. To make sure you got it right, you will need to see the HTTP requests your browser makes when submitting the form by hand and construct identical requests with cURL. To see the HTTP requests in detail you can use Firebug, Chrome's Developer Tools or the absolutely fantastic Fiddler debugging proxy.

Most probably after submitting a valid login form the server will send you a cookie to be used in authenticating you on subsequent requests. This cookie will be part of the headers of the server's HTTP response (Set-Cookie header). You will need to remember the value of that cookie, and include a Cookie header on subsequent scrapes to the server -- in essence you are doing exactly what your browser would if you were logged in**¹**.

And finally, you may need to make more than one round-trip to find your target. Maybe the URL you need to scrape isn't known beforehand, and you need to scrape a "list" page to find out some variable part of the URL you want to scrape. This can be solved by simply tackling the problem in steps: first scrape the "list" page, find out what you need, then scrape the "details" page you really want.

I 'm not providing any code, as there are tons of cURL tutorials on the web, but I believe that knowing what the plan is will make your work much much easier.


¹ Another (faster, but crude) way to go around doing this is by simply logging in yourself, seeing the value of the cookie you got, and just pasting that into your scrape's request. The upside is that you no longer need to fake a login with cURL; the downside is that before each time your tool is to be used, someone has to login manually and provide your tool with the credentials.

like image 126
Jon Avatar answered Sep 28 '22 03:09

Jon


there is a library called curl you should look into it

link

it allows your script to login, use cookies/sessions and scrape the content from any of the pages it follows you can set how depth it should go and if it should follow any redirects etc. you could even use it to post data. It's a great tool basically.

Here is also a link to a tutorial where you can see step for step how it works

http://devzone.zend.com/article/1081

like image 38
Breezer Avatar answered Sep 28 '22 02:09

Breezer