Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Paragraph Matching Python

Tags:

python

Background information

I have a Python script which generates word documents with the docx module. These documents are generated based on a log and then printed and stored as records. However, the log can be edited retroactively, so the document records need to be revised, and these revisions must be tracked. I'm not actually revising the documents, but generating a new one which shows the difference between what is currently in the log, and what will soon be in the log (the log is updated after the revised file is printed). When a revision occurs, my script uses diff_match_patch to generate a mark-up of what's changed with the following function:

def revFinder(str1,str2):
    dmp = dmp_module.diff_match_patch()

    diffs = dmp.diff_main(str1,str2)
    paratext = []

    for diff in diffs:
        paratext.append((diff[1], '' if diff[0] == 0 else ('s' if diff[0] == -1 else 'b')))

    return paratext

docx can take text either as strings, or by tuple if word-by-word formatting is required, so [see second bullet in "Some Things to Note"]

[("Hello, ", ''), ("my name ", 'b'), ("is Brad", 's')]

produces

Hello my name is Brad


The Problem

diff_match_patch is a very efficient code which finds the difference between two texts. Unfortuanly, its a little too efficient, so replacing redundant with dune results in

redunante

This is ugly, but its fine for single words. However, if an entire paragraph gets replaced, the results will be entirely unreadable. That is not ok.

Previously I addressed this by collapsing all the text into a single paragraph, but this was less than ideal because it became very cluttered and was still pretty ugly.


The Solution So Far

I have a function which creates the revision document. This function gets passed a list of tuples set up like this:

[(fieldName, original, revised)]

So the document is set up as

Orignial fieldName (With Markup)
  result of revFinder diffing orignal and revised

Revised fieldName    
  revised

I assume in order to resolve the problem, I'll need to do some sort of matching between paragraphs to make sure I don't diff two completely separate paragraphs. I'm also assuming this matching will depend if paragraphs are added or removed. Here's the code I have so far:

if len(item[1].split('\n')) + len(item[1].split('\n'))) == 2:

    body.append(heading("Original {} (With Markup)".format(item[0]),2))
    body.append(paragraph(revFinder(item[1],item[2])))
    body.append(paragraph("",style="BodyTextKeep"))
    body.append(heading("Revised {}".format(item[0]),2))
    body.append(paragraph(item[2]))
    body.append(paragraph(""))

else:
    diff = len(item[1].split('\n')) - len(item[1].split('\n'))
    if diff == 0:       

        body.append(heading("Original {} (With Markup)".format(item[0]),2))
        for orPara, revPara in zip(item[1].split('\n'),item[2].split('\n')):
            body.append(paragraph(revFinder(orPara,revPara)))
        body.append(paragraph("",style="BodyTextKeep"))
        body.append(heading("Revised {}".format(item[0]),2))
        for para in item[2].split('\n'):
            body.append(paragraph("{}".format(para)))     
        body.append(paragraph(""))

    elif diff > 0:
    #Removed paragraphs



    elif diff < 0: 
    #Added paragraphs  

So far I've planned on using something like difflib to do paragraph matching. But if there's a better way to avoid this problem that is a completely different approach, that's great too.


Some Things to Note:

  • I'm running Python 2.7.6 32-bit on Windows 7 64-bit
  • I've made some changes to my local copy of docx (namely adding the strike through formatting) so if you test this code you will not be able to replicate what I'm doing in that regard

Description of the Entire Process (with the revision steps in bold):

1) User opens Python script and uses GUI to add information to a thing called a "Condition Report" (CR)

NOTE: A full CR contains 4 parts, all completed by different people. But each part gets individually printed. All 4 parts are stored together in the log

2) When the user is finished, the information is saved to a log (described below), and then printed as a .docx file

3) The printed document is signed and stored

4) When the user wants to revise a part of the CR, the open the GUI, and edit the information in each of the fields. I am only concerned about a few of the fields in this question, and those are the multiline text controls (which can result in multiple paragraphs)

5) Once the user is done with the revision, the code generates the tuple list I described in the "Solution So Far" section, and sends this to the function which generates the revision document

6) The revision document is created, printed, signed, and stored with the original document for that part of that CR

7) The log is completely rewritten to include the revised information


The Log:

The log is simply a giant dict which stores all the information on all of the CRs. The general format is

{"Unique ID Number": [list of CR info]}

The log doesn't store past versions of a CR, so when a CR is revised the old information is overwritten (which is what we want for the system). As I mentioned earlier, every time the log is edited, the whole thing is rewritten. To get at the information in the log, I import it (since it always lives in the same directory as the script)

like image 797
wnnmaw Avatar asked Jan 30 '14 20:01

wnnmaw


1 Answers

Try using the post-diff cleanup options that diff_match_patch that @tzaman mentioned above, in particular, check out the diff_cleanupSemantic function which is intended for use when the diff output is intended to be human-readable.

Cleanup options are NOT run automatically, since diff_match_patch provides several cleanup options from which you may choose (depending on your needs).

Here is an example:

import diff_match_patch

dmp = diff_match_patch.diff_match_patch()
diffs = dmp.diff_main('This is my original paragraph.', 'My paragraph is much better now.')
print diffs  # pre-cleanup

dmp.diff_cleanupSemantic(diffs)
print diffs  # post cleanup

Output:

[(-1, 'This is m'), (1, 'M'), (0, 'y'), (-1, ' original'), (0, ' paragraph'), (1, ' is much better now'), (0, '.')]
[(-1, 'This is my original paragraph'), (1, 'My paragraph is much better now'), (0, '.')]

As you can see, the first diff is optimal but unreadable, while the second dif (after cleanup) is exactly what you are looking for.

like image 112
jdhildeb Avatar answered Oct 26 '22 10:10

jdhildeb