I have 2 similar strings. How can I find the most likely word alignment between these two strings in Python?
Example of input:
string1 = 'my channel is youtube dot com slash example and then I also do live streaming on twitch.'
string2 = 'my channel is youtube.com/example and then I also do livestreaming on twitch.'
Desired output:
alignment['my'] = 'my'
alignment['channel'] = 'channel'
alignment['is'] = 'is'
alignment['youtube'] = 'youtube.com/example'
alignment['dot'] = 'youtube.com/example'
alignment['com'] = 'youtube.com/example'
alignment['slash'] = 'youtube.com/example'
alignment['example'] = 'youtube.com/example'
alignment['and'] = 'and'
alignment['then'] = 'then'
alignment['I'] = 'I'
alignment['also'] = 'also'
alignment['do'] = 'do'
alignment['live'] = 'livestreaming'
alignment['streaming'] = 'livestreaming'
alignment['on'] = 'on'
alignment['twitch'] = 'twitch'
Alignment is tricky. spaCy can do it (see Aligning tokenization) but AFAIK it assumes that the two underlying strings are identical which is not the case here.
I used Bio.pairwise2 a few years back for a similar problem. I don't quite remember the exact settings, but here's what the default setup would give you:
from Bio import pairwise2
from Bio.pairwise2 import format_alignment
string1 = 'my channel is youtube dot com slash example and then I also do live streaming on twitch.'
string2 = 'my channel is youtube.com/example and then I also do livestreaming on twitch.'
alignments = pairwise2.align.globalxx(string1.split(),
string2.split(),
gap_char=['-']
)
The resulting alignments - pretty close already:
>>> format_alignment(*alignments[0])
my channel is youtube dot com slash example - and then I also do live streaming - on twitch.
| | | | | | | | | |
my channel is - - - - - youtube.com/example and then I also do - - livestreaming on twitch.
Score=10
You can provide your own matching functions, which would make fuzzywuzzy an interesting addition.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With