Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best generic strategy to group items using multiple criteria

I have a simple, real life problem I want to solve using an OO approach. My harddrive is a mess. I have 1.500.000 files, duplicates, complete duplicate folders, and so on...

The first step, of course, is parsing all the files into my database. No problems so far, now I got a lot of nice entries which are kind of "naturaly grouped". Examples for this simple grouping can be obtained using simple queries like:

  1. Give me all files bigger than 100MB
  2. Show all files older than 3 days
  3. Get me all files ending with docx

But now assume I want to find groups with a little more natural meaning. There are different strategies for this, depending on the "use case".

Assume I have a bad habit of putting all my downloaded files first on the desktop. Then I extract them to the appropriate folder, without deleting the ZIP file always. The I move them into a "attic" folder. For the system, to find this group of files a time oriented search approach, perhaps combined with a "check if ZIP is same then folder X" would be suitable.

Assume another bad habit of duplicating files, having some folder where "the clean files" are located in a nice structure, and another messy folders. Now my clean folder has 20 picture galleries, my messy folder has 5 duplicated and 1 new gallery. A human user could easily identify this logic by seeing "Oh, thats all just duplicates, thats a new one, so I put the new one in the clean folder and trash all the duplicates".

So, now to get to the point:

Which combination of strategies or patterns would you use to tackle such a situation. If I chain filters the "hardest" would win, and I have no idea how to let the system "test" for suitable combination. And it seemes to me it is more then just filtering. Its dynamic grouping by combining multiple criteria to find the "best" groups.

One very rough approach would be this:

  1. In the beginning, all files are equal
  2. The first, not so "good" group is the directory
  3. If you are a big, clean directory, you earn points (evenly distributed names)
  4. If all files have the same creation date, you may be "autocreated"
  5. If you are a child of Program-Files, I don't care for you at all
  6. If I move you, group A, into group C, would this improve the "entropy"

What are the best patterns fitting this situation. Strategy, Filters and Pipes, "Grouping".. Any comments welcome!

Edit in reacation to answers:

The tagging approach: Of course, tagging crossed my mind. But where do I draw the line. I could create different tag types, like InDirTag, CreatedOnDayXTag, TopicZTag, AuthorPTag. These tags could be structured in a hirarchy, but the question how to group would remain. But I will give this some thought and add my insights here..

The procrastination comment: Yes, it sounds like that. But the files are only the simplest example I could come up with (and the most relevant at the moment). Its actually part of the bigger picture of grouping related data in dynamic ways. Perhaps I should have kept it more abstract, to stress this: I am NOT searching for a file tagging tool or a search engine, but an algorithm or pattern to approach this problem... (or better, ideas, like tagging)

Chris

like image 505
Christian Avatar asked Oct 05 '08 12:10

Christian


2 Answers

You're procrastinating. Stop that, and clean up your mess. If it's really big, I recommend the following tactic:

  1. Make a copy of all the stuff on your drive on an external disk (USB or whatever)
  2. Do a clean install of your system
  3. As soon as you find you need something, get it from your copy, and place it in a well defined location
  4. After 6 months, throw away your external drive. Anything that's on there can't be that important.

You can also install Google Desktop, which does not clean your mess, but at least lets you search it efficiently.

If you want to prevent this from happening in the future, you have to change the way you're organizing things on your computer.

Hope this helps.

like image 74
Rolf Avatar answered Nov 08 '22 09:11

Rolf


I don't have a solution (and would love to see one), but I might suggest extracting metadata from your files besides the obvious name, size and timestamps.

  • in-band metadata such as MP3 ID3 tags, version information for EXEs / DLLs, HTML title and keywords, Summary information for Office documents etc. Even image files can have interesting metadata. A hash of the entire contents helps if looking for duplicates.
  • out-of-band metadata such as can be stored in NTFS alternate data streams - eg. what you can edit in the Summary tab for non-Office files
  • your browsers keep information on where you have downloaded files from (though Opera doesn't keep it for long), if you can read it.
like image 20
Hugh Allen Avatar answered Nov 08 '22 07:11

Hugh Allen