Is there a way to reduce memory usage of mini-batch kmeans?

Question

I'm working with a dataset that is 6.4 million samples with 500 dimensions and I'm trying to group it into 200 clusters. I'm limited to 90GB of RAM and when I try to run MiniBatchKmeans from sklearn.cluster, the OS kills the process for using up too much memory.

This is the code:

data = np.loadtxt('temp/data.csv', delimiter=',')
labels = np.genfromtxt('temp/labels', delimiter=',')

kmeans = cluster.MiniBatchKMeans(n_clusters=numClusters, random_state=0).fit(data)
predict = kmeans.predict(data)
Tdata = kmeans.transform(data)

It doesn't get past clustering.

Or Duan · Accepted Answer

The solution is to use sklearn's partial_fit method - not all algorithms has this option, but MiniBatchKMeans has it.

So you can train "partially", but you'll have to split your data and not reading it all in one go, this is can be done with generators, there is many ways to do it, if you use pandas for example, you can use this.

Then, instead of using fit, you should use partial_fit to train.

Is there a way to reduce memory usage of mini-batch kmeans?

Tags:

python

k-means

scikit-learn

bigdata

user1816679

1 Answers

Or Duan

Recent Activity

Donate For Us

Is there a way to reduce memory usage of mini-batch kmeans?

Tags:

python

k-means

scikit-learn

bigdata

user1816679

1 Answers

Or Duan

Related questions

Recent Activity

Donate For Us