consider the following Strings:
I'm trying to sort these in such a way that similar words comes together, I know
alphanumerical sorting
is not an option",-_ and etc
then comparing is certainly helpful but results won't be as good as I hope for.NOTE :
there might be few different desired ouput for this, one of which is :
DESIRED OUTPUT:
so my question is that if there is a java package that compares strings and ultimately sort them based on it .
I've heard of terms such as n-gram
and skip-gram
but didn't quite understand them. I'm not even sure if they can be useful for me at all.
UPDATE: finding similarities is certainly part of my question but the main problem is the sorting part.
Here's one possible approach.
Calculate the edit distance/Levenshtein distance between each pair of strings and then you use view the strings as a complete graph where the edge weights come from the edit distance. Choose a threshold for those weights and remove all the weights that to high. Then find the cliques in this graph. If your threshold is fairly low perhaps even finding connected components would be an option.
Note: Perhaps it would be better to substitute some edit distance with one of the similarity measures in the link that @dognose posted. Also, note that finding cliques will be very slow if you have a large numbers of strings
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With