Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Bulk insert with multiprocessing using peewee

I'm working on simple html scraper in Python 3.4, using peewee as ORM (great ORM btw!). My script takes a bunch of sites, extract necessary data and save them to the database, however every site is scraped in detached process, to improve performance and saved data should be unique. There can be duplicate data not only between sites, but also on particular site, so I want to store them only once.

Example: Post and Category - many-to-many relation. During scraping, same category appears multiple times in different posts. For the first time I want to save that category to database (create new row). If the same category shows up in different post, I want to bind that post with already created row in db.

My question is - do I have to use atomic updates/inserts (insert one post, save, get_or_create categories, save, insert new rows to many-to-many table, save) or can I use bulk insert somehow? What is the fastest solution to that problem? Maybe some temporary tables shared between processes, which will be bulk insert at the end of work? Im using MySQL db.

Thx for answers and your time

like image 752
Paweł Stysz Avatar asked Nov 09 '22 21:11

Paweł Stysz


1 Answers

You can rely on the database to enforce unique constraints by adding unique=True to fields or multi-column unique indexes. You can also check the docs on get/create and bulk inserts:

  • http://docs.peewee-orm.com/en/latest/peewee/models.html#indexes-and-unique-constraints
  • http://docs.peewee-orm.com/en/latest/peewee/querying.html#get-or-create
  • http://docs.peewee-orm.com/en/latest/peewee/querying.html#bulk-inserts
  • http://docs.peewee-orm.com/en/latest/peewee/querying.html#upsert - upsert with on conflict
like image 66
coleifer Avatar answered Nov 14 '22 22:11

coleifer