Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

what is the bootstrapped data in data mining?

recently I came across this term,but really have no idea what it refers to.I've searched online,but with little gain. Thanks.

like image 934
Kevin Avatar asked Sep 16 '10 09:09

Kevin


People also ask

Why do we bootstrap data?

Bootstrapping is a statistical procedure that resamples a single dataset to create many simulated samples. This process allows you to calculate standard errors, construct confidence intervals, and perform hypothesis testing for numerous types of sample statistics.

What is a bootstrapped sample?

In statistics, Bootstrap Sampling is a method that involves drawing of sample data repeatedly with replacement from a data source to estimate a population parameter.

What is bootstrapping and its types?

Bootstrapping is any test or metric that uses random sampling with replacement (e.g. mimicking the sampling process), and falls under the broader class of resampling methods. Bootstrapping assigns measures of accuracy (bias, variance, confidence intervals, prediction error, etc.) to sample estimates.

Why is it called bootstrapping?

So the answer is since "bootstrapping allows you to perform estimates from a single population", so the term like "standing on own feet" or "pull oneself up by own bootstraps" being used to indicate that.


2 Answers

Take a sample of the time of day that you wake up on Saturdays. Some Friday nights you have a few too many drinks, so you wake up early (but go back to bed). Other days you wake up at a normal time. Other days you sleep in.

Here are the results:

[3.1, 4.8, 6.3, 6.4, 6.6, 7.3, 7.5, 7.7, 7.9, 10.1]

What is the mean time that you wake up?

Well it's 6.8 (o'clock, or 6:48). A touch early for me.

How good a prediction is this of when you'll wake up next Saturday? Can you quantify how wrong you are likely to be?

It's a pretty small sample, and we're not sure of the distribution of the underlying process, so it might not be a good idea to use standard parametric statistical techniques†.

Why don't we take a random sample of our sample, and calculate the mean and repeat this? This will give us an estimate of how bad our estimate is.

I did this several times, and the mean was between 5.98 and 7.8

This is called the bootstrap, and it was first mentioned by Bradley Efron in 1979.

A variant is called the jackknife, where you sample all but one of your dataset, take the mean, and repeat. The jackknife mean is 6.8 (same as the arithmetic mean) and ranges from 6.4 to 7.2.

Another variant is called k-fold cross-validation, where you (at random) split your data set into k equally-sized sections, calculate the mean of all but one section, and repeat k times. The 5-fold cross-validation mean is 6.8 and ranges from 4 to 9.

† This distribution does happen to be Normal. The 95% confidence interval of the mean is 5.43 to 8.11, reasonably close but bigger than the bootstrap mean.

like image 114
Neil McGuigan Avatar answered Sep 30 '22 05:09

Neil McGuigan


If you don't have enough data to train your algorithm you can increase the size of your training set by (uniformly) randomly selecting items and duplicating them (with replacement).

like image 20
Michael Clerx Avatar answered Sep 30 '22 04:09

Michael Clerx