I'm looking for some robust, well documented PHP web crawler scripts. Perhaps a PHP port of the Java project - http://wiki.apache.org/nutch/NutchTutorial
I'm looking for both free and non free versions.
A Web Crawler is a program that crawls through the sites in the Web and find URL's. Normally Search Engines uses a crawler to find URL's on the Web. Google uses a crawler written in Python. There are some other search engines that uses different types of crawlers. For Web crawling we have to perform following steps-
Web scraping lets you collect data from web pages across the internet. It's also called web crawling or web data extraction. PHP is a widely used back-end scripting language for creating dynamic websites and web applications. And you can implement a web scraper using plain PHP code.
Here are the basic steps to build a crawler: Step 1: Add one or several URLs to be visited. Step 2: Pop a link from the URLs to be visited and add it to the Visited URLs thread. Step 3: Fetch the page's content and scrape the data you're interested in with the ScrapingBot API.
Just give Snoopy a try.
Excerpt: "Snoopy is a PHP class that simulates a web browser. It automates the task of retrieving web page content and posting forms, for example."
https://github.com/fabpot/Goutte is also a good library compatible with psr-0 standard.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With