Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create a diff of two complex data structures?

Problem specification:

I am currently searching for a elegant and/but efficient solution to a problem that i guess is quite common. Consider the following situation:

I defined a fileformat based on a BTree that is defined (in a simplified way) like this:

data FileTree = FileNode [Key] [FileOffset]
              | FileLeaf [Key] [Data]

Reading and writing this from a file to a lazy data structure is implemented and works just fine. This will result in a instance of:

data MemTree = MemNode [Key] [MemTree]
             | MemLeaf [Key] [Data]

Now my goal is to have a generic function updateFile :: FilePath -> (MemTree -> MemTree) -> IO () that will read in the FileTree and convert it into a MemTree, apply the MemTree -> MemTree function and write back the changes to the tree structure. The problem is that the FileOffsets have to be conserved somehow.

I have two approaches to this problem. Both of them lack in elegance and/or efficiency:

Approach 1: Extend MemTree to contain the offsets

This approach extends the MemTree to contain the offsets:

data MemTree = MemNode [Key] [(MemTree, Maybe FileOffset)]
             | MemNode [Key] [Data]

The read function would then read in the FileTree and stores the FileOffset alongside the MemTree reference. Writing will checks if a reference already has an associated offset and if it does it just uses it.

Pros: easy to implement, no overhead to find the offset

Cons: exposes internal to the user who is responsible to set the offset to Nothing

Approach 2: Store offsets in a secondary structure

Another way to attack this problem is to read in the FileTree and create a StableName.Map that holds onto the FileOffsets. That way (and if i understand the semantics of StableName correctly) it should be possible to take the final MemTree and lookup the StableName of each node in the the StableName.Map. If there is an entry the node is clean and doesn't have to be written again.

Pros: doesn't expose the internals to the user

Cons: involves overhead for lookups in the map

Conclusion

These are the two approaches i can think of. The first one should be more efficient, the second one is more pleasant to the eye. I'd like your comments on my ideas, maybe someone even has a better approach in mind?

[Edit] Reasonal

There are two reasons i am searching for a solution like this:

On the one hand you should try to handle errors before they arise by using the type system. The aforementioned user is of course the designer of the next layer in the system (ie me). By working on the pure tree representation some kinds of bugs won't be able to happen. All changes to the tree in the file should be in one place. That should make reasoning easier.

On the other hand i could just implement something like insert :: FilePath -> Key -> Value -> IO () and be done with it. But then i'll lose a very nice trait that comes free when i keep a (kind of a) log by updating the tree in place. Transactions (ie merging of several inserts) are just a matter of working on the same tree in memory and writing just the differences back to the file.

like image 952
fho Avatar asked Oct 13 '12 14:10

fho


People also ask

What are the 2 main types of data structures?

Basically, data structures are divided into two categories: Linear data structure. Non-linear data structure.

What are complex data structures?

Complex types are nested data structures composed of primitive data types. These data structures can also be composed of other complex types. Some examples of complex types include struct(row), array/list, map and union. Complex types are supported by most programming languages including Python, C++ and Java.

What are the different types of data structures explain briefly?

Data Structure can be defined as the group of data elements which provides an efficient way of storing and organising data in the computer so that it can be used efficiently. Some examples of Data Structures are arrays, Linked List, Stack, Queue, etc.


1 Answers

I think that the package Data.Generic.Diff may do exactly what you wanted. It references somebody's thesis for the idea of how it works.

like image 133
user21952-is-a-great-name Avatar answered Oct 19 '22 19:10

user21952-is-a-great-name