Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python - difference between two strings

I'd like to store a lot of words in a list. Many of these words are very similar. For example I have word afrykanerskojęzyczny and many of words like afrykanerskojęzycznym, afrykanerskojęzyczni, nieafrykanerskojęzyczni. What is the effective (fast and giving small diff size) solution to find difference between two strings and restore second string from the first one and diff?

like image 824
user2626682 Avatar asked Jul 28 '13 01:07

user2626682


People also ask

How do you compare strings in Python?

In python programming we can check whether strings are equal or not using the “==” or by using the “. __eq__” function. Example: s1 = 'String' s2 = 'String' s3 = 'string' # case sensitive equals check if s1 == s2: print('s1 and s2 are equal.

How do you find the difference between two strings?

To find the difference between 2 Strings you can use the StringUtils class and the difference method. It compares the two Strings, and returns the portion where they differ.

How do I compare two string lengths in Python?

The most straightforward method to determine if two strings are equal in Python is to use the == operator. This operator compares if two strings have the same value and returns True if they do. Else, it returns False .


2 Answers

You can use ndiff in the difflib module to do this. It has all the information necessary to convert one string into another string.

A simple example:

import difflib  cases=[('afrykanerskojęzyczny', 'afrykanerskojęzycznym'),        ('afrykanerskojęzyczni', 'nieafrykanerskojęzyczni'),        ('afrykanerskojęzycznym', 'afrykanerskojęzyczny'),        ('nieafrykanerskojęzyczni', 'afrykanerskojęzyczni'),        ('nieafrynerskojęzyczni', 'afrykanerskojzyczni'),        ('abcdefg','xac')]   for a,b in cases:          print('{} => {}'.format(a,b))       for i,s in enumerate(difflib.ndiff(a, b)):         if s[0]==' ': continue         elif s[0]=='-':             print(u'Delete "{}" from position {}'.format(s[-1],i))         elif s[0]=='+':             print(u'Add "{}" to position {}'.format(s[-1],i))         print()       

prints:

afrykanerskojęzyczny => afrykanerskojęzycznym Add "m" to position 20  afrykanerskojęzyczni => nieafrykanerskojęzyczni Add "n" to position 0 Add "i" to position 1 Add "e" to position 2  afrykanerskojęzycznym => afrykanerskojęzyczny Delete "m" from position 20  nieafrykanerskojęzyczni => afrykanerskojęzyczni Delete "n" from position 0 Delete "i" from position 1 Delete "e" from position 2  nieafrynerskojęzyczni => afrykanerskojzyczni Delete "n" from position 0 Delete "i" from position 1 Delete "e" from position 2 Add "k" to position 7 Add "a" to position 8 Delete "ę" from position 16  abcdefg => xac Add "x" to position 0 Delete "b" from position 2 Delete "d" from position 4 Delete "e" from position 5 Delete "f" from position 6 Delete "g" from position 7 
like image 51
dawg Avatar answered Oct 06 '22 13:10

dawg


I like the ndiff answer, but if you want to spit it all into a list of only the changes, you could do something like:

import difflib  case_a = 'afrykbnerskojęzyczny' case_b = 'afrykanerskojęzycznym'  output_list = [li for li in difflib.ndiff(case_a, case_b) if li[0] != ' '] 
like image 43
Eric Avatar answered Oct 06 '22 11:10

Eric