We are students trying to handling data size of about 140 million records and trying to run few machine learning algorithms. we are newbie to the entire cloud solutions and mahout implementations.Currently we have set them up in postgresql database but the current implementation doesn't scale up and read/write operations seems to be extremely slow after numerous performance tuning.Hence we are planning to go for cloud based services.
We have explored a few possible alternatives.
Here are the following questions
Thanks
It depends on the nature of the machine learning problem you want to solve. I would recommend you to first subsample your dataset to something that fits in memory (e.g. 100k samples with a few hundred non-zero features per samples assuming a sparse representation).
Then try a couple of machine learning algorithms that scale to large number of samples in scikit-learn:
Perform grid search to find the optimal values of the hyperparameters of the model (e.g. the regularizer alpha
and the number of passes n_iter
for SGDClassifier) and evaluate the performance using cross-validation.
Once done, retry with 2x larger dataset (still fitting in memory) and see if it improves you predictive accuracy significantly. If it's not the case then don't waste your time trying to parallelize this on a cluster to run that on the full dataset as it won't yield any better results.
If it does what you could do, is shard the data into pieces, then slices of data on each nodes, learn of SGDClassifier or SGDRegressor model on each node independently with picloud and collect back the weights (coef_
and intercept_
) and then compute the average weights to build the final linear model and evaluate it on some held out slice of your dataset.
To learn more about the error analysis. Have look at how to plot learning curves:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With