Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

clustering words based on their char set

Say there is a word set and I would like to clustering them based on their char bag (multiset). For example

{tea, eat, abba, aabb, hello}

will be clustered into

{{tea, eat}, {abba, aabb}, {hello}}.

abba and aabb are clustered together because they have the same char bag, i.e. two a and two b.

To make it efficient, a naive way I can think of is to covert each word into a char-cnt series, for exmaple, abba and aabb will be both converted to a2b2, tea/eat will be converted to a1e1t1. So that I can build a dictionary and group words with same key.

Two issues here: first I have to sort the chars to build the key; second, the string key looks awkward and performance is not as good as char/int keys.

Is there a more efficient way to solve the problem?

like image 257
user2671488 Avatar asked Oct 04 '22 06:10

user2671488


1 Answers

For detecting anagrams you can use a hashing scheme based on the product of prime numbers A->2, B->3, C->5 etc. will give "abba" == "aabb" == 36 (but a different letter to primenumber mapping will be better) See my answer here.

like image 114
wildplasser Avatar answered Oct 13 '22 01:10

wildplasser