Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sorting String based on similarities [closed]

consider the following Strings:

  • he llo
  • goodbye
  • hello
  • = (goodbye)
  • (he)(llo)
  • good bye
  • helium

I'm trying to sort these in such a way that similar words comes together, I know

  1. alphanumerical sorting is not an option
  2. removing special chars ",-_ and etc then comparing is certainly helpful but results won't be as good as I hope for.

NOTE :

there might be few different desired ouput for this, one of which is :

DESIRED OUTPUT:

  1. hello
  2. he llo
  3. (he)(llo)
  4. helium
  5. goodbye
  6. good bye
  7. = (goodbye)

so my question is that if there is a java package that compares strings and ultimately sort them based on it .

I've heard of terms such as n-gram and skip-gram but didn't quite understand them. I'm not even sure if they can be useful for me at all.

UPDATE: finding similarities is certainly part of my question but the main problem is the sorting part.

like image 885
nafas Avatar asked Jul 13 '15 09:07

nafas


1 Answers

Here's one possible approach.

Calculate the edit distance/Levenshtein distance between each pair of strings and then you use view the strings as a complete graph where the edge weights come from the edit distance. Choose a threshold for those weights and remove all the weights that to high. Then find the cliques in this graph. If your threshold is fairly low perhaps even finding connected components would be an option.

Note: Perhaps it would be better to substitute some edit distance with one of the similarity measures in the link that @dognose posted. Also, note that finding cliques will be very slow if you have a large numbers of strings

like image 132
Simon Avatar answered Oct 15 '22 10:10

Simon