I'm working with a dataset that is 6.4 million samples with 500 dimensions and I'm trying to group it into 200 clusters. I'm limited to 90GB of RAM and when I try to run MiniBatchKmeans from sklearn.cluster, the OS kills the process for using up too much memory.
This is the code:
data = np.loadtxt('temp/data.csv', delimiter=',')
labels = np.genfromtxt('temp/labels', delimiter=',')
kmeans = cluster.MiniBatchKMeans(n_clusters=numClusters, random_state=0).fit(data)
predict = kmeans.predict(data)
Tdata = kmeans.transform(data)
It doesn't get past clustering.
The solution is to use sklearn's partial_fit
method - not all algorithms has this option, but MiniBatchKMeans
has it.
So you can train "partially", but you'll have to split your data and not reading it all in one go, this is can be done with generators, there is many ways to do it, if you use pandas for example, you can use this.
Then, instead of using fit
, you should use partial_fit
to train.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With