recently I came across this term,but really have no idea what it refers to.I've searched online,but with little gain. Thanks.

Take a sample of the time of day that you wake up on Saturdays. Some Friday nights you have a few too many drinks, so you wake up early (but go back to bed). Other days you wake up at a normal time. Other days you sleep in. Here are the results: [3.1, 4.8, 6.3, 6.4, 6.6, 7.3, 7.5, 7.7, 7.9, 10.1] What is the mean time that you wake up? Well it's 6.8 (o'clock, or 6:48). A touch early for me. How good a prediction is this of when you'll wake up next Saturday? Can you quantify how wrong you are likely to be? It's a pretty small sample, and we're not sure of the distribution of the underlying process, so it might not be a good idea to use standard parametric statistical techniques&dagger;. Why don't we take a random sample of our sample, and calculate the mean and repeat this? This will give us an estimate of how bad our estimate is. I did this several times, and the mean was between 5.98 and 7.8 This is called the bootstrap, and it was first mentioned by Bradley Efron in 1979. A variant is called the jackknife, where you sample all but one of your dataset, take the mean, and repeat. The jackknife mean is 6.8 (same as the arithmetic mean) and ranges from 6.4 to 7.2. Another variant is called k-fold cross-validation, where you (at random) split your data set into k equally-sized sections, calculate the mean of all but one section, and repeat k times. The 5-fold cross-validation mean is 6.8 and ranges from 4 to 9. &dagger; This distribution does happen to be Normal. The 95% confidence interval of the mean is 5.43 to 8.11, reasonably close but bigger than the bootstrap mean.

what is the bootstrapped data in data mining?

2 Answers

Take a sample of the time of day that you wake up on Saturdays. Some Friday nights you have a few too many drinks, so you wake up early (but go back to bed). Other days you wake up at a normal time. Other days you sleep in.

Here are the results:

[3.1, 4.8, 6.3, 6.4, 6.6, 7.3, 7.5, 7.7, 7.9, 10.1]

What is the mean time that you wake up?

Well it's 6.8 (o'clock, or 6:48). A touch early for me.

How good a prediction is this of when you'll wake up next Saturday? Can you quantify how wrong you are likely to be?

It's a pretty small sample, and we're not sure of the distribution of the underlying process, so it might not be a good idea to use standard parametric statistical techniques†.

Why don't we take a random sample of our sample, and calculate the mean and repeat this? This will give us an estimate of how bad our estimate is.

I did this several times, and the mean was between 5.98 and 7.8

This is called the bootstrap, and it was first mentioned by Bradley Efron in 1979.

A variant is called the jackknife, where you sample all but one of your dataset, take the mean, and repeat. The jackknife mean is 6.8 (same as the arithmetic mean) and ranges from 6.4 to 7.2.

Another variant is called k-fold cross-validation, where you (at random) split your data set into k equally-sized sections, calculate the mean of all but one section, and repeat k times. The 5-fold cross-validation mean is 6.8 and ranges from 4 to 9.

† This distribution does happen to be Normal. The 95% confidence interval of the mean is 5.43 to 8.11, reasonably close but bigger than the bootstrap mean.

114

answered Sep 30 '22 05:09

Neil McGuigan

If you don't have enough data to train your algorithm you can increase the size of your training set by (uniformly) randomly selecting items and duplicating them (with replacement).

answered Sep 30 '22 04:09

Michael Clerx

Related questions
                            
                                How to transform items using sklearn Pipeline?
                            
                                How to balance classification using DecisionTreeClassifier?
                            
                                Naive Bayes without Naive assumption
                            
                                NotFittedError: TfidfVectorizer - Vocabulary wasn't fitted
                            
                                Multiprocessing scikit-learn
                            
                                Why Gaussian radial basis function maps the examples into an infinite-dimensional space?
                            
                                TypeError: __call__() missing 1 required positional argument: 'inputs'
                            
                                Batch gradient descent with scikit learn (sklearn)
                            
                                AttributeError: 'GridSearchCV' object has no attribute 'cv_results_'
                            
                                R: using ranger with caret, tuneGrid argument
                            
                                predicting class for new data using neuralnet
                            
                                Machine learning - Linear regression using batch gradient descent
                            
                                SVM versus MLP (Neural Network): compared by performance and prediction accuracy
                            
                                Keras Dense layer's input is not flattened
                            
                                Why am i getting AttributeError: 'KerasClassifier' object has no attribute 'model'?
                            
                                C5.0 decision tree - c50 code called exit with value 1
                            
                                Multi-class classification in libsvm [closed]
                            
                                Library in python for neural networks to plot ROC, AUC, DET [closed]
                            
                                .arff files with scikit-learn?
                            
                                support vector machines in matlab

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

what is the bootstrapped data in data mining?

Tags:

machine-learning

data-mining

Kevin

People also ask

2 Answers

Neil McGuigan

Michael Clerx

Recent Activity

Donate For Us