How to generate random categorical data in python according to a probability distribution? [closed]

Tags:

I am trying to generate a random column of categorical variable from an existing column to create some synthesized data. For example if my column has 3 values 0,1,2 with 0 appearing 50% of the time and 1 and 2 appearing 30 and 20% of the time I want my new random column to have similar (but not same) proportions as well

There is a similar question on cross validated that has been solved using R. https://stats.stackexchange.com/questions/14158/how-to-generate-random-categorical-data. However I would like a Python Solution for this

669

asked Aug 09 '19 18:08

Dwarkesh23

Video Answer

1 Answers

Use np.random.choice() and specify a vector of probabilities corresponding to the chosen-from arrray:

>>> import numpy as np 
>>> np.random.seed(444) 
>>> data = np.random.choice(  
...     a=[0, 1, 2],  
...     size=50,  
...     p=[0.5, 0.3, 0.2]  
... )                                                                                                                                                                                                                                                        
>>> data                                                                                                                                                                                                                                                     
array([2, 2, 1, 1, 0, 0, 0, 0, 0, 0, 2, 2, 0, 1, 0, 0, 0, 0, 2, 1, 0, 1,
       1, 1, 0, 2, 1, 1, 2, 1, 1, 0, 0, 0, 0, 2, 0, 1, 0, 2, 0, 2, 2, 2,
       1, 1, 1, 0, 0, 1])
>>> np.bincount(data) / len(data)    # Proportions                                                                                                                                                                                                                          
array([0.44, 0.32, 0.24])

As your sample size increases, the empirical frequencies should converge towards your targets:

>>> a_lot_of_data = np.random.choice(  
...     a=[0, 1, 2],  
...     size=500_000,  
...     p=[0.5, 0.3, 0.2]  
... )
>>> np.bincount(a_lot_of_data) / len(a_lot_of_data)                                                                                                                                                                                                          
array([0.499716, 0.299602, 0.200682])

As noted by @WarrenWeckesser, if you already have the 1d NumPy array or Pandas Series, you can use that directly as the input without specifying p. The default of np.random.choice() is to sample with replacement (replace=True), so by passing your original data, the resulting distribution should approximate that of the input.

112

answered Nov 09 '22 20:11

Brad Solomon

Related questions
                            
                                elegant way to reduce a list of dictionaries?
                            
                                Inserting an element before each element of a list
                            
                                What is the difference between dtype= and .astype() in numpy?
                            
                                Add leading zeroes to a string Python [duplicate]
                            
                                How is this a coroutine?
                            
                                Why can't I print to terminal with my python script?
                            
                                How to pass arguments to python function whose first parameter is self?
                            
                                meaning of comma operator in python
                            
                                for Imbalanced data dealing with cat boost
                            
                                Adding a line to a Pandas plot
                            
                                Infinite loops using 'for' in Python [duplicate]
                            
                                Python 3+ Tkinter Center Label Text
                            
                                Images dimensions error in python
                            
                                Install Shapely in Python 3
                            
                                How can I zoom my webcam in Open CV Python?
                            
                                Plotting Multiple Routes with OSMNx
                            
                                PySide2 application failed to start
                            
                                Python NetworkX - Why are graphs always randomly rotated?
                            
                                Flask-SocketIO send images
                            
                                Why is NotImplemented truthy in Python 3?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to generate random categorical data in python according to a probability distribution? [closed]

Tags:

random

python-3.x

pandas

numpy

Dwarkesh23

People also ask

Video Answer

1 Answers

Brad Solomon

Recent Activity

Donate For Us