I'm using PHP to scrape a website and collect some data. It's all done without using regex. I'm using php's explode() method to find particular HTML tags instead.
It is possible that if the structure of the website changes (CSS, HTML), then wrong data may be collected by the scraper. So the question is - how do I know if the HTML structure has changed? How to identify this before storing any data to my database to avoid wrong data being stored.
In order to check whether the website supports web scraping, you should append “/robots. txt” to the end of the URL of the website you are targeting. In such a case, you have to check on that special site dedicated to web scraping. Always be aware of copyright and read up on fair use.
You can't programmatically determine if a page is being scraped. But the same is possible statistically if your scraper becomes popular or you use it too heavily. You can make an educated guess if you see one IP grab the same page or pages at the same time every day.
In short, the action of web scraping isn't illegal. However there are some rules that need to be followed. Web scraping becomes illegal when non publicly available data becomes extracted.
Scraped content, even from high quality sources, without additional useful services or content provided by your site may not provide added value to users. It may also constitute copyright infringement. A site may also be demoted if a significant number of valid legal removal requests have been received.
I think you don't have any clean solutions if you are scraping a page where content changes.
I have developed several python scrapers and I know how can be frustrating when site just makes a subtle change on its layout.
You could try a solution a la mechanize (don't know the php counterpart) and if you are lucky you could isolate the content you need to extract (links?).
Another possibile approach would be to code some constraints and check them before store to db.
For example, if you are scraping Urls, you will need to verify that what scraper has parsed is formally a valid Url; same for integer ID or whatever you want to scrape that can be recognized as valid.
If you are scraping plain text, it will be more difficult to check.
If you want to know changes with respect to structure, I think the best way is to store the DOM structure of your first page and then compare it with new one.
There are lot of way you can do it:- SaxParser DOmParser etc
I have a small blog which will give some pointers to what I mean http://let-them-c.blogspot.com/2009/04/xml-as-objects-in-oops.html
or you can use http://en.wikipedia.org/wiki/Simple_API_for_XML or DOm Utility parser.
Depends on the site but you could count the number of page elements in the scraped page like div, class & style tags then by comparing these totals against those of later scrapes detect if the page structure has been changed.
A similiar process could be used for the CSS file where the names of each each class or id could be extracted using simple regex, stored and checked as needed. If this list has new additions then the page structure has almost certainly changed somewhere on the site being scraped.
Speaking out of my ass here, but its possible you might want to look at some Document Object Model PHP methods.
http://php.net/manual/en/book.dom.php
If my very, very limited understanding of DOM is correct, a change in HTML site structure would change the Document Object Model, but a simple content change within a fixed structure wouldn't. So, if you could capture the DOM state, and then compare it at each scrape, couldn't you in theory determine that such a change has been made?
(By the way, the way I did this when I was trying to get an email notification when the bar exam results were posted on a particular page was just compare file_get_contents() values. Surprisingly, worked flawlessly: No false positives, and emailed me as soon as the site posted the content.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With