Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Synonym finder algorithm

I think example will be much better than loooong description :)

Let's assume we have an array of arrays:

("Server1", "Server_1", "Main Server", "192.168.0.3")
("Server_1", "VIP Server", "Main Server")
("Server_2", "192.168.0.4")
("192.168.0.3", "192.168.0.5")
("Server_2", "Backup")

Each line contains strings which are synonyms. And as a result of processing of this array I want to get this:

("Server1", "Server_1", "Main Server", "192.168.0.3", "VIP Server", "192.168.0.5")
("Server_2", "192.168.0.4", "Backup")

So I think I need a kind of recursive algorithm. Programming language actually doesn't matter — I need only a little help with idea in general. I'm going to use php or python.

Thank you!

like image 598
DEgorov Avatar asked May 26 '11 06:05

DEgorov


1 Answers

This problem can be reduced to a problem in graph theory where you find all groups of connected nodes in a graph.

An efficient way to solve this problem is doing a "flood fill" algorithm, which is essentially a recursive breath first search. This wikipedia entry describes the flood fill algorithm and how it applies to solving the problem of finding connected regions of a graph.

To see how the original question can be made into a question on graphs: make each entry (e.g. "Server1", "Server_1", etc.) a node on a graph. Connect nodes with edges if and only if they are synonyms. A matrix data structure is particularly appropriate for keeping track of the edges, provided you have enough memory. Otherwise a sparse data structure like a map will work, especially since the number of synonyms will likely be limited.

  • Server1 is Node #0
  • Server_1 is Node #1
  • Server_2 is Node #2

Then edge[0][1] = edge[1][0] = 1, indicated that there is an edge between nodes #0 and #1 ( which means that they are synonyms ). While edge[0][2] = edge[2][0] = 0, indicating that Server1 and Server_2 are not synonyms.

Complexity Analysis

Creating this data structure is pretty efficient because a single linear pass with a lookup of the mapping of strings to node numbers is enough to crate it. If you store the mapping of strings to node numbers in a dictionary then this would be a O(n log n) step.

Doing the flood fill is O(n), you only visit each node in the graph once. So, the algorithm in all is O(n log n).

like image 149
Himadri Choudhury Avatar answered Sep 18 '22 12:09

Himadri Choudhury