How to know if the website being scraped has changed?

Tags:

I'm using PHP to scrape a website and collect some data. It's all done without using regex. I'm using php's explode() method to find particular HTML tags instead.

It is possible that if the structure of the website changes (CSS, HTML), then wrong data may be collected by the scraper. So the question is - how do I know if the HTML structure has changed? How to identify this before storing any data to my database to avoid wrong data being stored.

923

asked Mar 27 '10 17:03

Yeti

4 Answers

I think you don't have any clean solutions if you are scraping a page where content changes.

I have developed several python scrapers and I know how can be frustrating when site just makes a subtle change on its layout.

You could try a solution a la mechanize (don't know the php counterpart) and if you are lucky you could isolate the content you need to extract (links?).

Another possibile approach would be to code some constraints and check them before store to db.

For example, if you are scraping Urls, you will need to verify that what scraper has parsed is formally a valid Url; same for integer ID or whatever you want to scrape that can be recognized as valid.

If you are scraping plain text, it will be more difficult to check.

answered Oct 23 '22 18:10

systempuntoout

If you want to know changes with respect to structure, I think the best way is to store the DOM structure of your first page and then compare it with new one.

There are lot of way you can do it:- SaxParser DOmParser etc

I have a small blog which will give some pointers to what I mean http://let-them-c.blogspot.com/2009/04/xml-as-objects-in-oops.html

or you can use http://en.wikipedia.org/wiki/Simple_API_for_XML or DOm Utility parser.

answered Oct 23 '22 19:10

Kapil D

Depends on the site but you could count the number of page elements in the scraped page like div, class & style tags then by comparing these totals against those of later scrapes detect if the page structure has been changed.

A similiar process could be used for the CSS file where the names of each each class or id could be extracted using simple regex, stored and checked as needed. If this list has new additions then the page structure has almost certainly changed somewhere on the site being scraped.

answered Oct 23 '22 19:10

Skizz

Speaking out of my ass here, but its possible you might want to look at some Document Object Model PHP methods.

http://php.net/manual/en/book.dom.php

If my very, very limited understanding of DOM is correct, a change in HTML site structure would change the Document Object Model, but a simple content change within a fixed structure wouldn't. So, if you could capture the DOM state, and then compare it at each scrape, couldn't you in theory determine that such a change has been made?

(By the way, the way I did this when I was trying to get an email notification when the bar exam results were posted on a particular page was just compare file_get_contents() values. Surprisingly, worked flawlessly: No false positives, and emailed me as soon as the site posted the content.)

answered Oct 23 '22 17:10

phphelpplz

Related questions
                            
                                Laravel 5.5 unique validation rule on seperate table with different column name
                            
                                When would you need to use late static binding?
                            
                                How to combine requests for multiple javascript files into one http request?
                            
                                How do I check to see if a Smarty variable is already assigned?
                            
                                what is best way to improve performance of zend framework?
                            
                                hmac_sha256 in php and c# differ
                            
                                PHP-Soap pass method parameters
                            
                                Php, date manipulation?
                            
                                How can I use and access an SQLite DB using PHP and Wamp Server?
                            
                                Methods defined outside class?
                            
                                Hide index.php (or index.html) of an URL
                            
                                PHP/MySQL security--where to begin?
                            
                                unbindModel call in CakePhp. How does it work?
                            
                                Sending a file via HTTP PUT in PHP
                            
                                Zend Framework: Autoloading a Class Library
                            
                                Why did the creator of prado create Yii?
                            
                                Building PHP Competencies in an organization [closed]
                            
                                Problems opening php files in Eclipse
                            
                                Skipping PHP end tag [duplicate]
                            
                                Getting the contents of a file with PHP FTP

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to know if the website being scraped has changed?

Tags:

php

web-scraping

screen-scraping