Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Maximum Limit of distinct fake data using Python Faker package

Tags:

python

faker

I have used Python Faker for generating fake data. But I need to know what is the maximum number of distinct fake data (eg: fake names) can be generated using faker (eg: fake.name() ).

I have generated 100,000 fake names and I got less than 76,000 distinct names. I need to know the maximum limit so that I can know how much we can scale using this package for generating data.

I need to generate huge dataset. I also want to know is Php faker, perl faker are all same for different environments?

Other packages for generating huge dataset will be highly appreciated.

like image 664
Neron Joseph Avatar asked Nov 15 '17 04:11

Neron Joseph


People also ask

How do you make faker fake data?

Use faker. Faker() to create and initialize a faker generator, which can generate data by accessing properties named after the type of data you want. from faker import Faker fake = Faker() fake.name() # 'Lucy Cechtelar' fake.

What data can faker generate?

The Faker allows to generate random digits and integers. The example generates random digits and integers. We can specify the bounds in the random_int method.

What is faker in Python?

*Faker* is a Python package that generates fake data for you. Whether. you need to bootstrap your database, create good-looking XML documents, fill-in your persistence to stress test it, or anonymize data taken from. a production service, Faker is for you.


1 Answers

I had this same issue and looked more into it.

In the en_US provider there about 1000 last names and 750 first names for about 750000 unique combos. If you randomly select a first and last name, there is a chance you'll get duplicates. But in reality, that's how the real world works, there are many John Smiths and Robert Doyles out there.

There are 7203 first names and 473 last names in the en profile which can kind of help. Faker chooses the combo of first name and last name meaning there are about 7203 * 473 = 3407019.

But still, there is a chance you'll get duplicates.

I solve this problem by adding numbers to names.

I need to generate huge dataset.

Keep in mind that in reality, any huge dataset of names will have duplicates. I work with large datasets (> 1 million names) and we see a ton of duplicate first and last names.

If you read the faker package code, you can probably figure out how to modify it so you get all 3M distinct names.

like image 76
Mike Avatar answered Oct 01 '22 22:10

Mike