I am trying to generate a random column of categorical variable from an existing column to create some synthesized data. For example if my column has 3 values 0,1,2 with 0 appearing 50% of the time and 1 and 2 appearing 30 and 20% of the time I want my new random column to have similar (but not same) proportions as well
There is a similar question on cross validated that has been solved using R. https://stats.stackexchange.com/questions/14158/how-to-generate-random-categorical-data. However I would like a Python Solution for this
A categorical distribution is a discrete probability distribution whose sample space is the set of k individually identified items. It is the generalization of the Bernoulli distribution for a categorical random variable. , 0 otherwise.
Let P(X) be the probability that random number generated according to your distribution is less than X. You start with generating uniform random X between zero and one. After that you find Y such that P(Y) = X and output Y. You could find such Y using binary search (since P(X) is an increasing function of X).
The sample function The sample function is used to generate a random sample from a given population.
First, a prominent disclaimer is necessary. Most random data generated with Python is not fully random in the scientific sense of the word. Rather, it is pseudorandom: generated with a pseudorandom number generator (PRNG), which is essentially any algorithm for generating seemingly random but still reproducible data. “True”...
Probability distributions help model random phenomena, enabling us to obtain estimates of the probability that a certain event may occur. In this article, we’ll implement and visualize some of the commonly used probability distributions using Python
Probability Distributions with Python (Implemented Examples) Probability Distributions are mathematical functions that describe all the possible values and likelihoods that a random variable can take within a given range.
If you are using Python version less than 3.6, you can use the NumPy library to make weighted random choices. Install numpy using a pip install numpy. Using a numpy.random.choice () you can specify the probability distribution.
Use np.random.choice()
and specify a vector of probabilities corresponding to the chosen-from arrray:
>>> import numpy as np
>>> np.random.seed(444)
>>> data = np.random.choice(
... a=[0, 1, 2],
... size=50,
... p=[0.5, 0.3, 0.2]
... )
>>> data
array([2, 2, 1, 1, 0, 0, 0, 0, 0, 0, 2, 2, 0, 1, 0, 0, 0, 0, 2, 1, 0, 1,
1, 1, 0, 2, 1, 1, 2, 1, 1, 0, 0, 0, 0, 2, 0, 1, 0, 2, 0, 2, 2, 2,
1, 1, 1, 0, 0, 1])
>>> np.bincount(data) / len(data) # Proportions
array([0.44, 0.32, 0.24])
As your sample size increases, the empirical frequencies should converge towards your targets:
>>> a_lot_of_data = np.random.choice(
... a=[0, 1, 2],
... size=500_000,
... p=[0.5, 0.3, 0.2]
... )
>>> np.bincount(a_lot_of_data) / len(a_lot_of_data)
array([0.499716, 0.299602, 0.200682])
As noted by @WarrenWeckesser, if you already have the 1d NumPy array or Pandas Series, you can use that directly as the input without specifying p
. The default of np.random.choice()
is to sample with replacement (replace=True
), so by passing your original data, the resulting distribution should approximate that of the input.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With