How to create a diff of two complex data structures?

Problem specification:

I am currently searching for a elegant and/but efficient solution to a problem that i guess is quite common. Consider the following situation:

I defined a fileformat based on a BTree that is defined (in a simplified way) like this:

data FileTree = FileNode [Key] [FileOffset]
              | FileLeaf [Key] [Data]

Reading and writing this from a file to a lazy data structure is implemented and works just fine. This will result in a instance of:

data MemTree = MemNode [Key] [MemTree]
             | MemLeaf [Key] [Data]

Now my goal is to have a generic function updateFile :: FilePath -> (MemTree -> MemTree) -> IO () that will read in the FileTree and convert it into a MemTree, apply the MemTree -> MemTree function and write back the changes to the tree structure. The problem is that the FileOffsets have to be conserved somehow.

I have two approaches to this problem. Both of them lack in elegance and/or efficiency:

Approach 1: Extend MemTree to contain the offsets

This approach extends the MemTree to contain the offsets:

data MemTree = MemNode [Key] [(MemTree, Maybe FileOffset)]
             | MemNode [Key] [Data]

The read function would then read in the FileTree and stores the FileOffset alongside the MemTree reference. Writing will checks if a reference already has an associated offset and if it does it just uses it.

Pros: easy to implement, no overhead to find the offset

Cons: exposes internal to the user who is responsible to set the offset to Nothing

Approach 2: Store offsets in a secondary structure

Another way to attack this problem is to read in the FileTree and create a StableName.Map that holds onto the FileOffsets. That way (and if i understand the semantics of StableName correctly) it should be possible to take the final MemTree and lookup the StableName of each node in the the StableName.Map. If there is an entry the node is clean and doesn't have to be written again.

Pros: doesn't expose the internals to the user

Cons: involves overhead for lookups in the map

Conclusion

These are the two approaches i can think of. The first one should be more efficient, the second one is more pleasant to the eye. I'd like your comments on my ideas, maybe someone even has a better approach in mind?

[Edit] Reasonal

There are two reasons i am searching for a solution like this:

On the one hand you should try to handle errors before they arise by using the type system. The aforementioned user is of course the designer of the next layer in the system (ie me). By working on the pure tree representation some kinds of bugs won't be able to happen. All changes to the tree in the file should be in one place. That should make reasoning easier.

On the other hand i could just implement something like insert :: FilePath -> Key -> Value -> IO () and be done with it. But then i'll lose a very nice trait that comes free when i keep a (kind of a) log by updating the tree in place. Transactions (ie merging of several inserts) are just a matter of working on the same tree in memory and writing just the differences back to the file.

952

asked Oct 13 '12 14:10

fho

1 Answers

I think that the package Data.Generic.Diff may do exactly what you wanted. It references somebody's thesis for the idea of how it works.

133

answered Oct 19 '22 19:10

user21952-is-a-great-name

Related questions
                            
                                Is it bad form to make new types/datas for clarity? [closed]
                            
                                What's the practical value of all those newtype wrappers in `Data.Monoid`?
                            
                                IO and Maybe monad interaction
                            
                                Why can't Haskell be tricked into performing IO operations by using strict evaluation?
                            
                                How to model class hierarchies in Haskell?
                            
                                Does an IO monad make sense in a language like C#
                            
                                Determining if a given number is a prime in haskell
                            
                                Why can applicative functors have side effects, but functors can't?
                            
                                Idiomatic way to sum a list of Maybe Int in haskell
                            
                                Point-free in Haskell
                            
                                Haskell replace characters in string
                            
                                Ways to get the middle of a list in Haskell?
                            
                                Why is the type system refusing my seemingly valid program?
                            
                                Continue debugging in GHC after interrupt
                            
                                What is the precedence of ~, and why?
                            
                                Distributing Haskell on a cluster
                            
                                How do I specialise with an explicit type application?
                            
                                Strictness optimization and memory allocation in Haskell
                            
                                Generate optparse-applicative parser from free alternative functor
                            
                                Disable printing of IO results in GHCi?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to create a diff of two complex data structures?

Tags:

data-structures

haskell