Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a way to reduce memory usage of mini-batch kmeans?

I'm working with a dataset that is 6.4 million samples with 500 dimensions and I'm trying to group it into 200 clusters. I'm limited to 90GB of RAM and when I try to run MiniBatchKmeans from sklearn.cluster, the OS kills the process for using up too much memory.

This is the code:

data = np.loadtxt('temp/data.csv', delimiter=',')
labels = np.genfromtxt('temp/labels', delimiter=',')

kmeans = cluster.MiniBatchKMeans(n_clusters=numClusters, random_state=0).fit(data)
predict = kmeans.predict(data)
Tdata = kmeans.transform(data)

It doesn't get past clustering.

like image 560
user1816679 Avatar asked Oct 29 '22 09:10

user1816679


1 Answers

The solution is to use sklearn's partial_fit method - not all algorithms has this option, but MiniBatchKMeans has it.

So you can train "partially", but you'll have to split your data and not reading it all in one go, this is can be done with generators, there is many ways to do it, if you use pandas for example, you can use this.

Then, instead of using fit, you should use partial_fit to train.

like image 174
Or Duan Avatar answered Nov 15 '22 06:11

Or Duan