Say there is a word set and I would like to clustering them based on their char bag (multiset). For example
{tea, eat, abba, aabb, hello}
will be clustered into
{{tea, eat}, {abba, aabb}, {hello}}.
abba
and aabb
are clustered together because they have the same char bag, i.e. two a
and two b
.
To make it efficient, a naive way I can think of is to covert each word into a char-cnt series, for exmaple, abba
and aabb
will be both converted to a2b2
, tea/eat will be converted to a1e1t1
. So that I can build a dictionary and group words with same key.
Two issues here: first I have to sort the chars to build the key; second, the string key looks awkward and performance is not as good as char/int keys.
Is there a more efficient way to solve the problem?
For detecting anagrams you can use a hashing scheme based on the product of prime numbers A->2, B->3, C->5
etc. will give "abba" == "aabb" == 36 (but a different letter to primenumber mapping will be better)
See my answer here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With