Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Shortest sequence of operations transforming a file tree to another

Given two file trees A and B, is it possible to determine the shortest sequence of operations or a short sequence of operations that is necessary in order to transform A to B?

An operation can be:

  1. Create a new, empty folder
  2. Create a new file with any contents
  3. Delete a file
  4. Delete an empty folder
  5. Rename a file
  6. Rename a folder
  7. Move a file inside another existing folder
  8. Move a folder inside another existing folder

A and B are identical when they will have the same files with the same contents (or same size same CRC) and same name, in the same folder structure.

This question has been puzzling me for some time. For the moment I have the following, basic idea:

  • Compute a database:
    • Store file names and their CRCs
    • Then, find all folders with no subfolders, and compute a CRC from the CRCs of the files they contain, and a size from the total size of the files they contain
    • Ascend the tree to make a CRC for each parent folder
  • Use the following loop having database A and database B:
    • Compute A ∩ B and remove this intersection from both databases.
    • Use an inner join to find matching CRCs in A and B, folders first, order by size desc
    • while there is a result, use the first result to make a folder or file move (possibly creating new folders if necessary), remove from both database the source rows of the result. If there was a move then update CRCs of new location's parent folders in db A.
    • Then remove all files and folders referenced in database A and create those referenced in database B.

However I think that this is really a suboptimal way to do that. What could you give me as advice?

Thank you!

like image 984
Benoit Avatar asked Aug 01 '11 19:08

Benoit


2 Answers

This problem is a special case of the tree edit distance problem, for which finding an optimal solution is (unfortunately) known to be NP-hard. This means that there probably aren't any good, fast, and accurate algorithms for the general case.

That said, the paper I linked does contain several nice discussions of approximation algorithms and algorithms that work in restricted cases of the problem. You may find the discussion interesting, as it illuminates many of the issues that actually arise in solving this problem.

Hope this helps! And thanks for posting an awesome question!

like image 85
templatetypedef Avatar answered Sep 19 '22 13:09

templatetypedef


You might want to check out tree-edit distance algorithms. I don't know if this will map neatly to your file system, but it might give you some ideas.

https://github.com/irskep/sleepytree (code and paper)

like image 20
dfb Avatar answered Sep 21 '22 13:09

dfb