I am trying to implement brown clustering algorithm in python.
I have data structure of cluster = List[List]
At any gives time the outside list length will be maximum 40 or 41.
But internal list contains english words such as 'the', 'hello' etc
So I have total of words 8000(vocabulary) and initially first 40 words are put into cluster.
I iterate over my vocabulary from 41 to 8000 # do some compution this takes very less times. # Merge 2 item in list and delete one item from list # ex: if c1 and c2 are items of clusters then
for i in range(41, 8000):
clusters.append(vocabulary[i])
c1 = computation 1
c2 = computation 2
clusters[c1] = clusters[c1] + clusters[c2]
del clusters[c2]
But the time takes for line clusters[c1] = clusters[c1] + clusters[c1] grows gradually as i iterate over my vocabulary. Initially for 41-50 is it 1sec, but for every 20 items in vocabulary the time grows by 1 sec.
On commenting just clusters[c1] = clusters[c1] + clusters[c1] from my entire code, i observer all iterations takes constant time. I am not sure how can i speed up this process.
for i in range(41, 8000):
clusters.append(vocabulary[i])
c1 = computation 1
c2 = computation 2
#clusters[c1] = clusters[c1] + clusters[c2]
del clusters[c2]
I am new to stackoverflow, please excuse me if any incorrect formatting here.
Thanks
The problem you're running into is that list concatenation is a linear time operation. Thus, your entire loop is O(n^2) (and that's prohibitively slow for n much larger than 1000). This is ignoring how copying such large lists can be bad for cache performance, etc.
The solution I recommend is to use a disjoint set datastructure. This is a tree-based datastructure that "self-flattens" as you perform queries, resulting in a very fast runtimes for "merging" clusters.
The basic idea is that each word starts off as its own "singleton" tree, and merging clusters consists of making the root of one tree the child of another. This repeats (with some care for balancing) until you have as many clusters as desired.
I've written an example implementation (GitHub link) that assumes elements of each set are numbers. As long as you have a mapping from vocabulary terms to integers, it should work just fine for your purposes. (Note: I've done some preliminary testing, but I wrote it in 5 minutes right now so I'd recommend checking my work. ;) )
To use in your code, I would do something like the following:
clusters = DisjointSet(8000)
# some code to merge the first 40 words into clusters
for i in range(41, 8000):
c1 = some_computation() # assuming c1 is a number
c2 = some_computation() # assuming c2 is a number
clusters.join(c1, c2)
# Now, if you want to determine if some word with number k is
# in the same cluster as a word with number j:
print("{} and {} are in the same cluster? {}".format(j, k, clusters.query(j, k))
While sets provide faster access time than lists, they actually have worse runtime when copying. This makes sense in theory, because a set object actually has to allocate and assign more memory space than a list for an appropriate load factor. Also, it is likely inserting so many items could result in a "rehash" of the entire hash table, which is a quadratic-time operation in worst-case.
However, practice is what we're concerned with now, so I ran a quick experiment to determine exactly how worse off sets were than lists.

Code for performing this test, in case anyone was interested, is below. I'm using the Intel packaging of Python, so my performance may be slightly faster than on your machine.
import time
import random
import numpy as np
import matplotlib.pyplot as plt
data = []
for trial in range(5):
trial_data = []
for N in range(0, 20000, 50):
l1 = random.sample(range(1000000), N)
l2 = random.sample(range(1000000), N)
s1 = set(l1)
s2 = set(l2)
# Time to concatenate two lists of length N
start_lst = time.clock()
l3 = l1+l2
stop_lst = time.clock()
# Time to union two sets of length N
start_set = time.clock()
s3 = s1|s2
stop_set = time.clock()
trial_data.append([N, stop_lst - start_lst, stop_set - start_set])
data.append(trial_data)
# average the trials and plot
data_array = np.array(data)
avg_data = np.average(data_array, 0)
fig = plt.figure()
ax = plt.gca()
ax.plot(avg_data[:,0], avg_data[:,1], label='Lists')
ax.plot(avg_data[:,0], avg_data[:,2], label='Sets')
ax.set_xlabel('Length of set or list (N)')
ax.set_ylabel('Seconds to union or concat (s)')
plt.legend(loc=2)
plt.show()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With