Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the best/preferred approach to implement Maximum Likelihood Estimation for large data sets in GBs

I have a data-set in Gigabytes(GB) and want to estimate the parameters for missing values in that.

There is an algorithm called MLE(Maximum-likelihood Estimation) in machine learning that can be used for it.
Since R might not work on such a large data-set,so which library will be best to use for it?

like image 847
Nishu Tayal Avatar asked Dec 21 '25 16:12

Nishu Tayal


1 Answers

By wiki:MLE:

In statistics, maximum-likelihood estimation (MLE) is a method of estimating the parameters of a statistical model. When applied to a data set and given a statistical model, maximum-likelihood estimation provides estimates for the model's parameters.

Generally you need two steps before you can apply MLE:

  • obtain a dataset
  • identify a statistical model

At this time, if you can obtain an analytic form of solution for the MLE estimate, just stream your data to the mle-estimate calculation, e.g., for gaussian distribution, to estimate mean, you just accumulate the sum, and keep the count and the sample mean will be your mle-estimate.

However, when the model involves many parameters and its pdf is highly non-linear. In such situations, the MLE estimate must be sought numerically using nonlinear optimization algorithms. If your data size is huge, try stochastic gradient descent, the true gradient is approximated by a gradient at a single example. As the algorithm sweeps through the training set, it performs the update formula for each training example. So that you can still stream your data one at a time to your update program in multiple sweeps fashion. In this way, memory constraint should not be a problem at all.

like image 81
greeness Avatar answered Dec 24 '25 09:12

greeness



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!