I am working on a number of websites with files dating back to 2000. These sites have grown organically over time resulting in large numbers of orphaned web pages, include files, images, CSS files, JavaScript files, etc... These orphaned files cause a number of problems including poor maintainability, possible security holes, poor customer experience, and driving OCD/GTD freaks like myself crazy.
These files number in the thousands so a completely manual solution is not feasible. Ultimately, the cleanup process will require a fairly large QA effort in order to ensure we have not inadvertently deleted needed files but I am hoping to develop a technological solution to help speed the manual effort. Additionally, I hope to put processes/utilities in place to help prevent this state of disorganization from happening in the future.
Environment Considerations:
Before I start I would like to get some feedback from others who have successfully navigated a similar process.
Specifically I am looking for:
I am not looking for:
Because the old link still exists on that other website, Google will still find it. How to fix: Since you don't control the links on other websites, the only way to fix this type of orphan page is to reach out to the site owner and ask them to update to the correct new location of the page.
At first I thought you could get away by scanning files for links, and then doing a diff against your folder structure - but this only identifies simple orphans, not collections of orphaned files that reference each other. So, using grep probably won't get you all the way there.
This isn't a trivial solution, but would make an excellent utility for keeping your environment clean (and therefore, worth the effort). Plus, you can re-use it across all environments (and share it with others!)
The basic idea is to setup and populate a directional graph where each node's key is an absolute path. This is done by scanning all the files and adding dependencies - for example:
/index.html -> /subfolder/file.jpg
-> /subfolder/temp.html
-> /error.html
/temp.html -> /index.html
/error.html
/stray.html -> /index.html
/abandoned.html
Then, you can identify all your "reachable" files by doing a BFS on your root page.
With the directional graph, you can also classify files by their in and out degree. In the example above:
/index.html in: 1 out: 2
/temp.html in: 1 out: 1
/error.html in: 1 out: 0
/stray.html in: 0 out: 1
/abandoned.html in: 0 out: 0
So, you're basically looking for files that have in = 0 that are abandoned.
Additionally, files that have out = 0 are going to be terminal pages; which may or may not be desirable on your site (as error suggests, it's an error page).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With