How to determine character similarity?

Tags:

I am using the Levenshtein distance to find similar strings after OCR. However, for some strings the edit distance is the same, although the visual appearance is obviously different.

For example the string Co will return these matches:

Click to copy

CY (1)
CZ (1)
Ca (1)

Considering, that Co is the result from an OCR engine, Ca would be the more likely match than the ones. Therefore, after calculating the Levenshtein distance, I'd like to refine query result by ordering by visual similarity. In order to calculate this similarity a I'd like to use standard sans-serif font, like Arial.

Is there a library I can use for this purpose, or how could I implement this myself? Alternatively, are there any string similarity algorithms that are more accurate than the Levenshtein distance, which I could use in addition?

452

asked May 03 '12 14:05

Pedro

2 Answers

If you're looking for a table that will allow you to calculate a 'replacement cost' of sorts based on visual similarity, I've been searching for such a thing for awhile with little success, so I started looking at it as a new problem. I'm not working with OCR, but I am looking for a way to limit the search parameters in a probabilistic search for mis-typed characters. Since they are mis-typed because a human has confused the characters visually, the same principle should apply to you.

My approach was to categorize letters based on their stroke components in an 8-bit field. the bits are, left to right:

Click to copy

7: Left Vertical
6: Center Vertical
5: Right Vertical
4: Top Horizontal
3: Middle Horizontal
2: Bottom Horizontal
1: Top-left to bottom-right stroke
0: Bottom-left to top-right stroke

For lower-case characters, descenders on the left are recorded in bit 1, and descenders on the right in bit 0, as diagonals.

With that scheme, I came up with the following values which attempt to rank the characters according to visual similarity.

Click to copy

m:               11110000: F0
g:               10111101: BD
S,B,G,a,e,s:     10111100: BC
R,p:             10111010: BA
q:               10111001: B9
P:               10111000: B8
Q:               10110110: B6
D,O,o:           10110100: B4
n:               10110000: B0
b,h,d:           10101100: AC
H:               10101000: A8
U,u:             10100100: A4
M,W,w:           10100011: A3
N:               10100010: A2
E:               10011100: 9C
F,f:             10011000: 98
C,c:             10010100: 94
r:               10010000: 90
L:               10000100: 84
K,k:             10000011: 83
T:               01010000: 50
t:               01001000: 48
J,j:             01000100: 44
Y:               01000011: 43
I,l,i:           01000000: 40
Z,z:             00010101: 15
A:               00001011: 0B
y:               00000101: 05
V,v,X,x:         00000011: 03

This, as it stands, is too primitive for my purposes and requires more work. You may be able to use it, however, or perhaps adapt it to suit your purposes. The scheme is fairly simple. This ranking is for a mono-space font. If you are using a sans-serif font, then you likely have to re-work the values.

This table is a hybrid table including all characters, lower- and upper-case, but if you split it into upper-case only and lower-case only it might prove more effective, and that would also allow to apply specific casing penalties.

Keep in mind that this is early experimentation. If you see a way to improve it (for example by changing the bit-sequencing) by all means feel free to do so.

130

answered Sep 20 '22 04:09

David

In general I've seen Damerau-Levenshtein used much more often than just Levenshtein , and it basically adds the transposition operation. It is supposed to account for more than 80% of human misspelling, so you should certainly consider that.

As to your specific problem, you could easily modify the algorithm to increase the cost when substituting a capital letter with a non capital letter, and the opposite to obtain something like that:

Click to copy

dist(Co, CY) = 2
dist(Co, CZ) = 2
dist(Co, Ca) = 1

answered Sep 21 '22 04:09

Charles Menguy

Related questions
                            
                                Linkedlist keep track of min in constant time?
                            
                                Need an algorithm to make all rows and columns have non-negative sums
                            
                                Syntax Highlighting / Lexical analysis Algorithms
                            
                                Best fit plane by minimizing orthogonal distances
                            
                                Improving Minimax Algorithm
                            
                                How do you generate a random number in between 1 and 7 by using a function which generates 1 to 5 [duplicate]
                            
                                Determine if a set of data is from a linear or logarithmic function?
                            
                                Separation and pattern matching techniques
                            
                                What is the worst case for KMP string search algorithm? [closed]
                            
                                Matlab: Algorithm for voronoi diagram of ellipses
                            
                                Class formulation concept from C++ and Java to Ada
                            
                                Number distribution
                            
                                How to design a latest recently used cache?
                            
                                SAT/CNF optimization
                            
                                Implementing a hash table
                            
                                How to generate an AVL tree as lopsided as possible?
                            
                                Better Solution to Project Euler #36?
                            
                                Is there an algorithm to find nearby points using only polar coordinates?
                            
                                Best way to test a clustering algorithm
                            
                                Programming Puzzle: How to paint a board?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to determine character similarity?

Tags:

algorithm

pattern-matching

levenshtein-distance

similarity

Pedro

People also ask

2 Answers

David

Charles Menguy

Recent Activity

Donate For Us