Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Updating nested list takes too long in python

I am trying to implement brown clustering algorithm in python.

I have data structure of cluster = List[List]

At any gives time the outside list length will be maximum 40 or 41.

But internal list contains english words such as 'the', 'hello' etc

So I have total of words 8000(vocabulary) and initially first 40 words are put into cluster.

I iterate over my vocabulary from 41 to 8000 # do some compution this takes very less times. # Merge 2 item in list and delete one item from list # ex: if c1 and c2 are items of clusters then

for i in range(41, 8000):
    clusters.append(vocabulary[i])
    c1 = computation 1
    c2 = computation 2
    clusters[c1] = clusters[c1] + clusters[c2]
    del clusters[c2]

But the time takes for line clusters[c1] = clusters[c1] + clusters[c1] grows gradually as i iterate over my vocabulary. Initially for 41-50 is it 1sec, but for every 20 items in vocabulary the time grows by 1 sec.

On commenting just clusters[c1] = clusters[c1] + clusters[c1] from my entire code, i observer all iterations takes constant time. I am not sure how can i speed up this process.

for i in range(41, 8000):
    clusters.append(vocabulary[i])
    c1 = computation 1
    c2 = computation 2
    #clusters[c1] = clusters[c1] + clusters[c2]
    del clusters[c2]

I am new to stackoverflow, please excuse me if any incorrect formatting here.

Thanks

like image 687
Vignesh Vignesh Avatar asked Apr 16 '26 10:04

Vignesh Vignesh


1 Answers

The problem you're running into is that list concatenation is a linear time operation. Thus, your entire loop is O(n^2) (and that's prohibitively slow for n much larger than 1000). This is ignoring how copying such large lists can be bad for cache performance, etc.

Disjoint Set data structure

The solution I recommend is to use a disjoint set datastructure. This is a tree-based datastructure that "self-flattens" as you perform queries, resulting in a very fast runtimes for "merging" clusters.

The basic idea is that each word starts off as its own "singleton" tree, and merging clusters consists of making the root of one tree the child of another. This repeats (with some care for balancing) until you have as many clusters as desired.

I've written an example implementation (GitHub link) that assumes elements of each set are numbers. As long as you have a mapping from vocabulary terms to integers, it should work just fine for your purposes. (Note: I've done some preliminary testing, but I wrote it in 5 minutes right now so I'd recommend checking my work. ;) )

To use in your code, I would do something like the following:

clusters = DisjointSet(8000)
# some code to merge the first 40 words into clusters
for i in range(41, 8000):
    c1 = some_computation() # assuming c1 is a number
    c2 = some_computation() # assuming c2 is a number
    clusters.join(c1, c2)

# Now, if you want to determine if some word with number k is 
# in the same cluster as a word with number j:
print("{} and {} are in the same cluster? {}".format(j, k, clusters.query(j, k))

Regarding Sets vs Lists

While sets provide faster access time than lists, they actually have worse runtime when copying. This makes sense in theory, because a set object actually has to allocate and assign more memory space than a list for an appropriate load factor. Also, it is likely inserting so many items could result in a "rehash" of the entire hash table, which is a quadratic-time operation in worst-case.

However, practice is what we're concerned with now, so I ran a quick experiment to determine exactly how worse off sets were than lists.

Set vs List

Code for performing this test, in case anyone was interested, is below. I'm using the Intel packaging of Python, so my performance may be slightly faster than on your machine.

import time
import random
import numpy as np
import matplotlib.pyplot as plt

data = []
for trial in range(5):
    trial_data = []

    for N in range(0, 20000, 50):
        l1 = random.sample(range(1000000), N)
        l2 = random.sample(range(1000000), N)
        s1 = set(l1)
        s2 = set(l2)

        # Time to concatenate two lists of length N
        start_lst = time.clock()
        l3 = l1+l2
        stop_lst = time.clock()

        # Time to union two sets of length N
        start_set = time.clock()
        s3 = s1|s2
        stop_set  = time.clock()

        trial_data.append([N, stop_lst - start_lst, stop_set - start_set])
    data.append(trial_data)

# average the trials and plot
data_array = np.array(data)
avg_data = np.average(data_array, 0)

fig = plt.figure()
ax = plt.gca()
ax.plot(avg_data[:,0], avg_data[:,1], label='Lists')
ax.plot(avg_data[:,0], avg_data[:,2], label='Sets')
ax.set_xlabel('Length of set or list (N)')
ax.set_ylabel('Seconds to union or concat (s)')
plt.legend(loc=2)
plt.show()
like image 152
apnorton Avatar answered Apr 18 '26 22:04

apnorton