Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do document diff algorithms work?

Tags:

algorithm

diff

I want to implement word document differ, what algorithms does it requires to implement?

like image 451
user183173 Avatar asked Oct 02 '09 15:10

user183173


People also ask

What algorithm does git diff use?

Myers. Myers algorithm was developed by Myers (1986). In the git diff command, this algorithm is used as the default. The operation of this algorithm traces the two primary identical sequences recursively with the least edited script.

What is a diff programming?

Alternatively referred to as compare, diff is short for different or difference and describes a program's ability to show the difference between two or more files. A diff is an invaluable tool in programming as it enables a developer to see what has changed in-between versions.

Who invented diff?

The conventional automobile differential was invented in 1827 by a Frenchman, Onésiphore Pecqueur. It was used first on steam-driven vehicles and was a well-known device when internal-combustion engines appeared at the end of the 19th century.


2 Answers

Well, generally speaking, diff'ing is usually solved by the Longest common subsequence problem. Also see the "Algorithm" section of the Wikipedia article on Diff:

The operation of diff is based on solving the longest common subsequence problem.

In this problem, you have two sequences of items:

   a b c d f g h j q z     a b c d e f g i j k r x y z 

and you want to find the longest sequence of items that is present in both original sequences in the same order. That is, you want to find a new sequence which can be obtained from the first sequence by deleting some items, and from the second sequence by deleting other items. You also want this sequence to be as long as possible. In this case it is

   a b c d f g j z 

From the longest common subsequence it's only a small step to get diff-like output:

   e   h i   q   k r x y     +   - +   -   + + + + 

That said, this all works fine with text based documents. Since Word Documents are effectively in a binary format, and include lots of formatting information and data, this will be far more complex. Ideally, you could look into automating Word itself as it has the ability to "diff" between documents, as detailed here:

Microsoft Word Tip: How to compare two documents for differences

like image 166
CraigTP Avatar answered Sep 18 '22 20:09

CraigTP


A diff is essentially just a solution to the longest common sub-sequence problem.

The optimal solution requires knowledge of dynamic programming so it's a fairly complex problem to solve.

However, it can also be done by constructing a suffix-tree. Both algorithms are outlined here.

like image 25
Ben S Avatar answered Sep 18 '22 20:09

Ben S