I want to compare two documents irrespective of line breaks. If the content is the same but the position and quantity of line breaks are different, I want to map the lines in one document to the lines in the other.
Given:
Document 1
I went to Paris in July 15, where I met some nice people.
And I came back
to NY in Aug 15.
I am planning
to go there soon
after I finish what I do.
Document 2
I went
to Paris
in July 15,
where I met
some nice people.
And I came back to NY in Aug 15.
I am planning to go
there soon after I finish what I do.
I want an algorithm capable of determining that line 1 in Document 1 contains the same text as lines 1 through 5 in Document 2, that lines 2 and 3 in Document 1 contain the same text as line 6 in Document 2, etc.
1 = 1,2,3,4,5
2,3 = 6
4,5,6 = 7,8
Is there a way with regex to match each line in each document if it spans over multiple lines in the other documents?
import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import org.apache.commons.io.FileUtils;
public class Compare {
public static void main(String[] args) throws IOException {
String doc1 = FileUtils.readFileToString(new File("Doc1.txt"));
String doc2 = FileUtils.readFileToString(new File("Doc2.txt"));
String[] array1 = doc1.split("\n");
String[] array2 = doc2.split("\n");
int[] count1 = new int[array1.length];
int[] count2 = new int[array2.length];
int sum1 = 0;
int sum2 = 0;
for (int i=0;i<count1.length;i++) {
count1[i] = sum1 + array1[i].split(" ").length;
sum1 = count1[i];
}
for (int i=0;i<count2.length;i++) {
count2[i] = sum2 + array2[i].split(" ").length;
sum2 = count2[i];
}
ArrayList<Integer> result1 = new ArrayList<Integer>();
ArrayList<Integer> result2 = new ArrayList<Integer>();
for (int j=0; j<count1.length; ) {
for (int k=0; k<count2.length; ) {
if (count1[j]==count2[k]) {
result1.add(j+1);
result2.add(k+1);
System.out.println(result1.toString()+" = "+result2.toString());
result1 = new ArrayList<Integer>();
result2 = new ArrayList<Integer>();
j++;k++;
} else if (count1[j]>count2[k]) {
result2.add(k+1);
k++;
} else {
result1.add(j+1);
j++;
}
}
}
}
}
Sample output:
[1] = [1, 2, 3, 4, 5]
[2, 3] = [6]
[4, 5, 6] = [7, 8]
Complete and working Java code. It's not a regex solution, so it might not fit your need.
The idea is that we create an array for each document. The size of the array equals the number of lines in each document. The nth element of the array stores the number of words seen up to the nth line of the document. Then we identify those equal elements in both arrays, whose indices define the ranges of the output.
I am not a python programmer, but this does not look like a problem that can be solved with regex.
Instead, you'd first want to compare the documents to make sure the content is the same (temporarily remove all newlines beforehand). I don't know what you want done if it isn't, so I'm not going to address that.
Create a collection of integer collections called linemappings
Begin a loop. The loop will step through each character in each document simultaneously. You'll need four counter variables. charindex1
will contain the current character index in Document 1 and charindex2
will contain the current charater index in Document 2. lineindex1
will contain the current line index in Document 1 and lineindex2
will contain the current line index in Document 2.
Start with the char index variables to 0 and the line index variables initialized to 1.
Start Loop:
Get the current character from each document:
char1
from document 1 andchar2
from document 2.If
char1
ANDchar2
are BOTH newlines or NEITHER are newlines, then advance bothcharindex1
andcharindex2
by 1.
Else Ifchar1
is a newline, then advancecharindex1
by 1.
Else Ifchar2
is a newline, then advancecharindex2
by 1.If EITHER
char1
orchar2
is a newline, then insert a new record into thelinemappings
collection (the result at the end will be something like[[1,1],[1,2],[1,3],[1,4],[1,5],[2,6],[3,6],[4,7],[5,7],[6,7],[6,8]
)If
char1
is a newline, advancelineindex1
by 1.
Ifchar2
is a newline, advancelineindex2
by 1.Loop until the end of input is reached.
(I couldn't really test this since I'm not a python programmer, but hopefully you get the gist and can modify it to fit your needs.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With