Maximum Limit of distinct fake data using Python Faker package

1 Answers

I had this same issue and looked more into it.

In the en_US provider there about 1000 last names and 750 first names for about 750000 unique combos. If you randomly select a first and last name, there is a chance you'll get duplicates. But in reality, that's how the real world works, there are many John Smiths and Robert Doyles out there.

There are 7203 first names and 473 last names in the en profile which can kind of help. Faker chooses the combo of first name and last name meaning there are about 7203 * 473 = 3407019.

But still, there is a chance you'll get duplicates.

I solve this problem by adding numbers to names.

I need to generate huge dataset.

Keep in mind that in reality, any huge dataset of names will have duplicates. I work with large datasets (> 1 million names) and we see a ton of duplicate first and last names.

If you read the faker package code, you can probably figure out how to modify it so you get all 3M distinct names.

answered Oct 01 '22 22:10

Mike

Related questions
                            
                                Benefit of using custom initialize function instead of `__init__` in python
                            
                                Python: name of parent package not recognized in import statements
                            
                                Kivy error: raise FactoryException('Unknown class <%s>' % name)
                            
                                Apodization Mask for Fast Fourier Transforms in Python
                            
                                Pyinstaller with Tensorflow takes incorrect path for _checkpoint_ops.so file
                            
                                How to include a git repo as a dependency when using pbr
                            
                                adding flower to celery daemon?
                            
                                Time.sleep seems to be blocking main thread, not just child thread?
                            
                                Django | update requirements.txt automatically after installing new package
                            
                                Big HDF5 dataset, how to efficienly shuffle after each epoch
                            
                                Python iter() time complexity?
                            
                                Keras fit_generator with pandas iterator object
                            
                                Modified kivy scatter widget does not update transformation
                            
                                Error in R: The h5py Python package is required to save and load models
                            
                                Python list.clear complexity [duplicate]
                            
                                Tensorflow - How to freeze a .pb from the SavedModel to be used for inference in TensorFlowInferenceInterface?
                            
                                How to count members of a set in a string in Python?
                            
                                Select only first page of PDF pypdf2
                            
                                Avoiding select after flush when assigning to child relationship
                            
                                How do I find the number of cores available to MPI(4PY)?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Maximum Limit of distinct fake data using Python Faker package

Tags:

python

faker

Neron Joseph

People also ask

1 Answers

Mike

Recent Activity

Donate For Us