Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Random sampling from a dataset, while preserving original probability distribution

I have a set of >2000 numbers, gathered from measurement. I want to sample from this data set, ~10 times in each test, while preserving probability distribution overall, and in each test (to extent approximately possible). For example, in each test, I want some small value, some middle class value, some big value, with the mean and variance approximately close to the original distribution. Combining all the tests, I also want the total mean and variance of all the samples, approximately close to the original distribution.

As my dataset is a long-tail probability distribution, the amount of data at each quantile are not the same:

Probability density

Fig 1. Density plot of ~2k elements of data.

I am using Java, and right now I am using a uniform distribution, and use a random int from the dataset, and return the data element at that position:

public int getRandomData() {
    int data[] ={1231,414,222,4211,,41,203,123,432,...};
    length=data.length;
    Random r=new Random();
    int randomInt = r.nextInt(length);
    return data[randomInt];
}

I don't know if it works as I want, because I use data in order it is measured, which has great amount of serial correlation.

like image 587
Ho1 Avatar asked Sep 12 '15 14:09

Ho1


People also ask

Does random sampling preserve distribution?

Random sampling preserves the probability distribution.

What is probability distribution of a random sample?

The probability distribution for a random variable describes how the probabilities are distributed over the values of the random variable. For a discrete random variable, x, the probability distribution is defined by a probability mass function, denoted by f(x).


1 Answers

It works as you want. The order of the data is irrelevant.

like image 95
Rex D Avatar answered Oct 04 '22 04:10

Rex D