Using Python Faker generate different data for 5000 rows

Tags:

I would like to use the Python Faker library to generate 500 lines of data, however I get repeated data using the code I came up with below. Can you please point out where I'm going wrong. I believe it has something to do with the for loop. Thanks in advance:

from faker import Factory
import pandas as pd
import random

def create_fake_stuff(fake):


df = pd.DataFrame(columns=('name'
    , 'email'
    , 'bs'
    , 'address'
    , 'city'
    , 'state'
    , 'date_time'
    , 'paragraph'
    , 'Conrad'
    ,'randomdata'))

stuff = [fake.name()
    , fake.email()
    , fake.bs()
    , fake.address()
    , fake.city()
    , fake.state()
    , fake.date_time()
    , fake.paragraph()
    , fake.catch_phrase()
    , random.randint(1000,2000)]

for i in range(10):
        df.loc[i] = [item for item in stuff]
print(df)

if __name__ == '__main__':
    fake = Factory.create()
    create_fake_stuff(fake)

557

asked Aug 08 '17 17:08

Conrad Addo

3 Answers

Disclaimer: this answer is added much after the question and adds some new info not directly answering the question.

Now there is a fast new library Mimesis - Fake Data Generator.

Upside: It is stated it works times faster than faker (see below my test of data similar to one in question).
Downside: works from 3.6 version of Python only.

pip install mimesis

>>> from mimesis import Person
>>> from mimesis.enums import Gender
>>> person = Person('en')

>>> person.full_name(gender=Gender.FEMALE)
'Antonetta Garrison'
>>> personru = Person('ru')
>>> personru.full_name()
'Рената Черкасова'

The same with developed earlier faker:

pip install faker

>>> from faker import Faker
>>> fake_ru=Faker('ja_JP')
>>> fake_ru=Faker('ru_RU')
>>> fake_jp=Faker('ja_JP')
>>> print (fake_ru.name())
Субботина Елена Наумовна
>>> print (fake_jp.name())
大垣 花子

Below it my recent timing of Mimesis vs. Faker based on code provided in answer from forzer0eight:

from faker import Faker
import pandas as pd
import random
fake = Faker()
def create_rows_faker(num=1):
    output = [{"name":fake.name(),
                   "address":fake.address(),
                   "name":fake.name(),
                   "email":fake.email(),
                   #"bs":fake.bs(),
                   "city":fake.city(),
                   "state":fake.state(),
                   "date_time":fake.date_time(),
                   #"paragraph":fake.paragraph(),
                   #"Conrad":fake.catch_phrase(),
                   "randomdata":random.randint(1000,2000)} for x in range(num)]
    return output

%%time
df_faker = pd.DataFrame(create_rows_faker(5000))

CPU times: user 3.51 s, sys: 2.86 ms, total: 3.51 s Wall time: 3.51 s

from mimesis import Person
from mimesis import Address
from mimesis.enums import Gender
from mimesis import Datetime
person = Person('en')
import pandas as pd
import random
person = Person()
addess = Address()
datetime = Datetime()
def create_rows_mimesis(num=1):
    output = [{"name":person.full_name(gender=Gender.FEMALE),
                   "address":addess.address(),
                   "name":person.name(),
                   "email":person.email(),
                   #"bs":person.bs(),
                   "city":addess.city(),
                   "state":addess.state(),
                   "date_time":datetime.datetime(),
                   #"paragraph":person.paragraph(),
                   #"Conrad":person.catch_phrase(),
                   "randomdata":random.randint(1000,2000)} for x in range(num)]
    return output

%%time
df_mimesis = pd.DataFrame(create_rows_mimesis(5000))

CPU times: user 178 ms, sys: 1.7 ms, total: 180 ms Wall time: 179 ms

Below is resulting data for comparison:

df_faker.head(2)
address city    date_time   email   name    randomdata  state
0   3818 Goodwin Haven\nBrocktown, GA 06168 Valdezport  2004-10-18 20:35:52 [email protected] Deborah Garcia  1218    Oklahoma
1   2568 Gonzales Field\nRichardhaven, NC 79149 West Rachel 1985-02-03 00:33:00 [email protected]  Barbara Pineda  1536    Tennessee

df_mimesis.head(2)
address city    date_time   email   name    randomdata  state
0   351 Nobles Viaduct  Cedar Falls 2013-08-22 08:20:25.288883  [email protected] Ernest  1673    Georgia
1   517 Williams Hill   Malden  2008-01-26 18:12:01.654995  [email protected]  Jonathan    1845    North Dakota

142

answered Sep 29 '22 09:09

Alexei Martianov

Following scripts can remarkably enhance the pandas performance.

    from faker import Faker
    import pandas as pd
    import random
    fake = Faker()
    def create_rows(num=1):
        output = [{"name":fake.name(),
                   "address":fake.address(),
                   "name":fake.name(),
                   "email":fake.email(),
                   "bs":fake.bs(),
                   "address":fake.address(),
                   "city":fake.city(),
                   "state":fake.state(),
                   "date_time":fake.date_time(),
                   "paragraph":fake.paragraph(),
                   "Conrad":fake.catch_phrase(),
                   "randomdata":random.randint(1000,2000)} for x in range(num)]
        return output

It takes 5.55s.

    %%time
    df = pd.DataFrame(create_rows(5000))

    Wall time: 5.55 s

answered Sep 29 '22 09:09

huang06

I placed the fake stuff array inside my for loop to achieve the desired result:

for i in range(10):
    stuff = [fake.name()
        , fake.email()
        , fake.bs()
        , fake.address()
        , fake.city()
        , fake.state()
        , fake.date_time()
        , fake.paragraph()
        , fake.catch_phrase()
        , random.randint(1000, 2000)]
    df.loc[i] = [item for item in stuff]
    print(df)

answered Sep 29 '22 09:09

Conrad Addo

Related questions
                            
                                Python 3 static members
                            
                                Using callable(x) vs. hasattr(x, "__call__")
                            
                                Python request module - Getting response cookies
                            
                                Expose C++ buffer as Python 3 bytes
                            
                                pip, easy_install commands not working in Ubuntu. Python 2.7 and 3.4 are installed
                            
                                How to encode image to send over Python HTTP server?
                            
                                How can I format a float with given precision and zero padding?
                            
                                Comparing date strings in python
                            
                                What happened to ifilter?
                            
                                No module named 'requests' Python 3.5.0
                            
                                How does a for loop evaluate its argument
                            
                                Fast way to split an int into bytes
                            
                                Multi threading in Tkinter GUI, threads in different classes
                            
                                Unable to download nltk data
                            
                                Install Openalpr in Windows python
                            
                                What is the difference between the apply() function and a function call using the object of the class?
                            
                                calculate precision and recall in a confusion matrix
                            
                                Getting the target of a symbolic link with pathlib
                            
                                Python Pandas Dataframe merge and pick only few columns
                            
                                mypy error: List or tuple literal expected as the second argument to namedtuple()

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using Python Faker generate different data for 5000 rows

Tags:

random

python-3.x

pandas

dataframe

faker

Conrad Addo

People also ask

3 Answers

Alexei Martianov

huang06

Conrad Addo

Recent Activity

Donate For Us