I am working on a number of websites with files dating back to 2000. These sites have grown organically over time resulting in large numbers of orphaned web pages, include files, images, CSS files, JavaScript files, etc... These orphaned files cause a number of problems including poor maintainability, possible security holes, poor customer experience, and driving OCD/GTD freaks like myself crazy. These files number in the thousands so a completely manual solution is not feasible. Ultimately, the cleanup process will require a fairly large QA effort in order to ensure we have not inadvertently deleted needed files but I am hoping to develop a technological solution to help speed the manual effort. Additionally, I hope to put processes/utilities in place to help prevent this state of disorganization from happening in the future. Environment Considerations: <ul> <li>Classic ASP and .Net</li> <li>Windows servers running IIS 6 and IIS 7 </li> <li>Multiple environments (Dev, Integration, QA, Stage, Prodction)</li> <li>TFS for source control</li> </ul> Before I start I would like to get some feedback from others who have successfully navigated a similar process. Specifically I am looking for: <ul> <li>Process for identifying and cleaning up orphaned files</li> <li>Process for keeping environments clean from orphaned files</li> <li>Utilities that help identify orphaned files</li> <li>Utilities that help identify broken links (once files have been removed)</li> </ul> I am not looking for: <ul> <li>Solutions to my organizational OCD...I like how I am.</li> <li>Snide comments about us still using classic ASP. I already feel the pain. There is no need to rub it in.</li> </ul>

At first I thought you could get away by scanning files for links, and then doing a diff against your folder structure - but this only identifies simple orphans, not collections of orphaned files that reference each other. So, using grep probably won't get you all the way there. This isn't a trivial solution, but would make an excellent utility for keeping your environment clean (and therefore, worth the effort). Plus, you can re-use it across all environments (and share it with others!) The basic idea is to setup and populate a directional graph where each node's key is an absolute path. This is done by scanning all the files and adding dependencies - for example: <pre class="prettyprint"><code>/index.html -> /subfolder/file.jpg -> /subfolder/temp.html -> /error.html /temp.html -> /index.html /error.html /stray.html -> /index.html /abandoned.html </code></pre> Then, you can identify all your "reachable" files by doing a BFS on your root page. With the directional graph, you can also classify files by their in and out degree. In the example above: <pre class="prettyprint"><code>/index.html in: 1 out: 2 /temp.html in: 1 out: 1 /error.html in: 1 out: 0 /stray.html in: 0 out: 1 /abandoned.html in: 0 out: 0 </code></pre> So, you're basically looking for files that have in = 0 that are abandoned. Additionally, files that have out = 0 are going to be terminal pages; which may or may not be desirable on your site (as error suggests, it's an error page).

Finding and removing orphaned web pages, images, and other related files

Tags:

asp.net

asp-classic

vbscript

gtd

I am working on a number of websites with files dating back to 2000. These sites have grown organically over time resulting in large numbers of orphaned web pages, include files, images, CSS files, JavaScript files, etc... These orphaned files cause a number of problems including poor maintainability, possible security holes, poor customer experience, and driving OCD/GTD freaks like myself crazy.

These files number in the thousands so a completely manual solution is not feasible. Ultimately, the cleanup process will require a fairly large QA effort in order to ensure we have not inadvertently deleted needed files but I am hoping to develop a technological solution to help speed the manual effort. Additionally, I hope to put processes/utilities in place to help prevent this state of disorganization from happening in the future.

Environment Considerations:

Classic ASP and .Net
Windows servers running IIS 6 and IIS 7
Multiple environments (Dev, Integration, QA, Stage, Prodction)
TFS for source control

Before I start I would like to get some feedback from others who have successfully navigated a similar process.

Specifically I am looking for:

Process for identifying and cleaning up orphaned files
Process for keeping environments clean from orphaned files
Utilities that help identify orphaned files
Utilities that help identify broken links (once files have been removed)

I am not looking for:

Solutions to my organizational OCD...I like how I am.
Snide comments about us still using classic ASP. I already feel the pain. There is no need to rub it in.

425

asked Nov 09 '09 18:11

William Edmondson

1 Answers

At first I thought you could get away by scanning files for links, and then doing a diff against your folder structure - but this only identifies simple orphans, not collections of orphaned files that reference each other. So, using grep probably won't get you all the way there.

This isn't a trivial solution, but would make an excellent utility for keeping your environment clean (and therefore, worth the effort). Plus, you can re-use it across all environments (and share it with others!)

The basic idea is to setup and populate a directional graph where each node's key is an absolute path. This is done by scanning all the files and adding dependencies - for example:

/index.html     -> /subfolder/file.jpg
                -> /subfolder/temp.html
                -> /error.html
/temp.html      -> /index.html
/error.html     
/stray.html     -> /index.html
/abandoned.html

Then, you can identify all your "reachable" files by doing a BFS on your root page.

With the directional graph, you can also classify files by their in and out degree. In the example above:

/index.html     in: 1 out: 2
/temp.html      in: 1 out: 1
/error.html     in: 1 out: 0
/stray.html     in: 0 out: 1
/abandoned.html in: 0 out: 0

So, you're basically looking for files that have in = 0 that are abandoned.

Additionally, files that have out = 0 are going to be terminal pages; which may or may not be desirable on your site (as error suggests, it's an error page).

112

answered Oct 21 '22 03:10

emptyset

Related questions
                            
                                Seemingly infinite stack trace in EF 4.0 and poor query performance under load
                            
                                ASP.NET Web API 2 file upload
                            
                                How to exclude certain properties from binding in ASP.NET Web Api
                            
                                ASP.NET OAuth having issues with URL Rewrite
                            
                                Error on change style of page with jquery
                            
                                SignalR causing bad request 400 seen on the server
                            
                                Does an 'Account Activation' workflow exist for ASP.NET MVC
                            
                                Sign in as different user when using Integrated Windows Authentication
                            
                                Send Sms Using way2sms api
                            
                                What alternatives are there to SignalR in .NET?
                            
                                ASP.NET LinkButton / ImageButton and JQuery Validate?
                            
                                ASP.NET refuses to respect my authority.
                            
                                Autodividers listview with collapsable option
                            
                                SqlConnection vs Sql Session. Do their lifetimes coincide?
                            
                                How to display a custom error page when Request Validation Exceptions are thrown?
                            
                                How to implement ASP.NET identity: CREATE DATABASE permission denied in database 'master'
                            
                                HttpContext.Current.Session unclear behaviour boolean
                            
                                Why is ASP compiling my views so frequently?
                            
                                Enable HTTP compression with ASP.NET Web API
                            
                                ASP.NET: localize content with mixed HTML formatting

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With