Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to generate random categorical data in python according to a probability distribution? [closed]

I am trying to generate a random column of categorical variable from an existing column to create some synthesized data. For example if my column has 3 values 0,1,2 with 0 appearing 50% of the time and 1 and 2 appearing 30 and 20% of the time I want my new random column to have similar (but not same) proportions as well

There is a similar question on cross validated that has been solved using R. https://stats.stackexchange.com/questions/14158/how-to-generate-random-categorical-data. However I would like a Python Solution for this

like image 669
Dwarkesh23 Avatar asked Aug 09 '19 18:08

Dwarkesh23


People also ask

Can you use probability distribution for categorical data?

A categorical distribution is a discrete probability distribution whose sample space is the set of k individually identified items. It is the generalization of the Bernoulli distribution for a categorical random variable. , 0 otherwise.

How can we generate randomness according to a given probability distribution?

Let P(X) be the probability that random number generated according to your distribution is less than X. You start with generating uniform random X between zero and one. After that you find Y such that P(Y) = X and output Y. You could find such Y using binary search (since P(X) is an increasing function of X).

Which function you will use to generate random samples based on the categorical distribution?

The sample function The sample function is used to generate a random sample from a given population.

How random is random data generated with Python?

First, a prominent disclaimer is necessary. Most random data generated with Python is not fully random in the scientific sense of the word. Rather, it is pseudorandom: generated with a pseudorandom number generator (PRNG), which is essentially any algorithm for generating seemingly random but still reproducible data. “True”...

What are probability distributions in Python?

Probability distributions help model random phenomena, enabling us to obtain estimates of the probability that a certain event may occur. In this article, we’ll implement and visualize some of the commonly used probability distributions using Python

What are probability distributions?

Probability Distributions with Python (Implemented Examples) Probability Distributions are mathematical functions that describe all the possible values and likelihoods that a random variable can take within a given range.

How to make weighted random choices in Python?

If you are using Python version less than 3.6, you can use the NumPy library to make weighted random choices. Install numpy using a pip install numpy. Using a numpy.random.choice () you can specify the probability distribution.


Video Answer


1 Answers

Use np.random.choice() and specify a vector of probabilities corresponding to the chosen-from arrray:

>>> import numpy as np 
>>> np.random.seed(444) 
>>> data = np.random.choice(  
...     a=[0, 1, 2],  
...     size=50,  
...     p=[0.5, 0.3, 0.2]  
... )                                                                                                                                                                                                                                                        
>>> data                                                                                                                                                                                                                                                     
array([2, 2, 1, 1, 0, 0, 0, 0, 0, 0, 2, 2, 0, 1, 0, 0, 0, 0, 2, 1, 0, 1,
       1, 1, 0, 2, 1, 1, 2, 1, 1, 0, 0, 0, 0, 2, 0, 1, 0, 2, 0, 2, 2, 2,
       1, 1, 1, 0, 0, 1])
>>> np.bincount(data) / len(data)    # Proportions                                                                                                                                                                                                                          
array([0.44, 0.32, 0.24])

As your sample size increases, the empirical frequencies should converge towards your targets:

>>> a_lot_of_data = np.random.choice(  
...     a=[0, 1, 2],  
...     size=500_000,  
...     p=[0.5, 0.3, 0.2]  
... )
>>> np.bincount(a_lot_of_data) / len(a_lot_of_data)                                                                                                                                                                                                          
array([0.499716, 0.299602, 0.200682])

As noted by @WarrenWeckesser, if you already have the 1d NumPy array or Pandas Series, you can use that directly as the input without specifying p. The default of np.random.choice() is to sample with replacement (replace=True), so by passing your original data, the resulting distribution should approximate that of the input.

like image 112
Brad Solomon Avatar answered Nov 09 '22 20:11

Brad Solomon