Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find the most likely word alignment between two strings in Python

I have 2 similar strings. How can I find the most likely word alignment between these two strings in Python?

Example of input:

string1 = 'my channel is youtube dot com slash example and then I also do live streaming on twitch.'
string2 = 'my channel is youtube.com/example and then I also do livestreaming on twitch.'

Desired output:

alignment['my']        = 'my'
alignment['channel']   = 'channel'
alignment['is']        = 'is'
alignment['youtube']   = 'youtube.com/example'
alignment['dot']       = 'youtube.com/example'
alignment['com']       = 'youtube.com/example'
alignment['slash']     = 'youtube.com/example'
alignment['example']   = 'youtube.com/example'
alignment['and']       = 'and'
alignment['then']      = 'then'
alignment['I']         = 'I'
alignment['also']      = 'also'
alignment['do']        = 'do'
alignment['live']      = 'livestreaming'
alignment['streaming'] = 'livestreaming'
alignment['on']        = 'on'
alignment['twitch']    = 'twitch'
like image 826
Franck Dernoncourt Avatar asked Jun 12 '26 13:06

Franck Dernoncourt


1 Answers

Alignment is tricky. spaCy can do it (see Aligning tokenization) but AFAIK it assumes that the two underlying strings are identical which is not the case here.

I used Bio.pairwise2 a few years back for a similar problem. I don't quite remember the exact settings, but here's what the default setup would give you:

from Bio import pairwise2
from Bio.pairwise2 import format_alignment


string1 = 'my channel is youtube dot com slash example and then I also do live streaming on twitch.'
string2 = 'my channel is youtube.com/example and then I also do livestreaming on twitch.'

alignments = pairwise2.align.globalxx(string1.split(), 
                                      string2.split(),
                                      gap_char=['-']
                                     )

The resulting alignments - pretty close already:

>>> format_alignment(*alignments[0])
my channel is youtube dot com slash example          -          and then I also do live streaming       -       on twitch. 
 |    |     |                                                    |    |  |   |   |                               |    |    
my channel is    -     -   -    -      -    youtube.com/example and then I also do  -       -     livestreaming on twitch. 
  Score=10

You can provide your own matching functions, which would make fuzzywuzzy an interesting addition.

like image 197
fsimonjetz Avatar answered Jun 15 '26 03:06

fsimonjetz



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!