Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Comparing two documents using regex

I want to compare two documents irrespective of line breaks. If the content is the same but the position and quantity of line breaks are different, I want to map the lines in one document to the lines in the other.

Given:

Document 1

I went to Paris in July 15, where I met some nice people.
And I came back
to NY in Aug 15.
I am planning
to go there soon
after I finish what I do.

Document 2

I went
to Paris
in July 15,
where I met
some nice people.
And I came back to NY in Aug 15.
I am planning to go
there soon after I finish what I do.

I want an algorithm capable of determining that line 1 in Document 1 contains the same text as lines 1 through 5 in Document 2, that lines 2 and 3 in Document 1 contain the same text as line 6 in Document 2, etc.

1 = 1,2,3,4,5
2,3 = 6
4,5,6 = 7,8

Is there a way with regex to match each line in each document if it spans over multiple lines in the other documents?

like image 928
hmghaly Avatar asked Feb 01 '13 18:02

hmghaly


2 Answers

import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import org.apache.commons.io.FileUtils;

public class Compare {
    public static void main(String[] args) throws IOException {
        String doc1 = FileUtils.readFileToString(new File("Doc1.txt"));
        String doc2 = FileUtils.readFileToString(new File("Doc2.txt"));
        String[] array1 = doc1.split("\n");
        String[] array2 = doc2.split("\n");
        int[] count1 = new int[array1.length];
        int[] count2 = new int[array2.length];
        int sum1 = 0;
        int sum2 = 0;
        for (int i=0;i<count1.length;i++) {
            count1[i] = sum1 + array1[i].split(" ").length;
            sum1 = count1[i];
        }
        for (int i=0;i<count2.length;i++) {
            count2[i] = sum2 + array2[i].split(" ").length;
            sum2 = count2[i];
        }
        ArrayList<Integer> result1 = new ArrayList<Integer>();
        ArrayList<Integer> result2 = new ArrayList<Integer>();
        for (int j=0; j<count1.length; ) {
            for (int k=0; k<count2.length; ) {
                if (count1[j]==count2[k]) {
                    result1.add(j+1);
                    result2.add(k+1);
                    System.out.println(result1.toString()+" = "+result2.toString());
                    result1 = new ArrayList<Integer>();
                    result2 = new ArrayList<Integer>();
                    j++;k++;
                } else if (count1[j]>count2[k]) {
                    result2.add(k+1);
                    k++;
                } else {
                    result1.add(j+1);
                    j++;
                }
            }
        }
    }
}

Sample output:

[1] = [1, 2, 3, 4, 5]
[2, 3] = [6]
[4, 5, 6] = [7, 8]

Complete and working Java code. It's not a regex solution, so it might not fit your need.

The idea is that we create an array for each document. The size of the array equals the number of lines in each document. The nth element of the array stores the number of words seen up to the nth line of the document. Then we identify those equal elements in both arrays, whose indices define the ranges of the output.

like image 182
Terry Li Avatar answered Nov 06 '22 04:11

Terry Li


I am not a python programmer, but this does not look like a problem that can be solved with regex.

Instead, you'd first want to compare the documents to make sure the content is the same (temporarily remove all newlines beforehand). I don't know what you want done if it isn't, so I'm not going to address that.

Create a collection of integer collections called linemappings

Begin a loop. The loop will step through each character in each document simultaneously. You'll need four counter variables. charindex1 will contain the current character index in Document 1 and charindex2 will contain the current charater index in Document 2. lineindex1 will contain the current line index in Document 1 and lineindex2 will contain the current line index in Document 2.

Start with the char index variables to 0 and the line index variables initialized to 1.

Start Loop:

Get the current character from each document: char1 from document 1 and char2 from document 2.

If char1 AND char2 are BOTH newlines or NEITHER are newlines, then advance both charindex1 and charindex2 by 1.
Else If char1 is a newline, then advance charindex1 by 1.
Else If char2 is a newline, then advance charindex2 by 1.

If EITHER char1 or char2 is a newline, then insert a new record into the linemappings collection (the result at the end will be something like [[1,1],[1,2],[1,3],[1,4],[1,5],[2,6],[3,6],[4,7],[5,7],[6,7],[6,8])

If char1 is a newline, advance lineindex1 by 1.
If char2 is a newline, advance lineindex2 by 1.

Loop until the end of input is reached.

(I couldn't really test this since I'm not a python programmer, but hopefully you get the gist and can modify it to fit your needs.)

like image 26
JDB Avatar answered Nov 06 '22 02:11

JDB