Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Web scraping with python and sqlite. How to store scraped data effectively?

I want to scrape some specific webpages on a regular basis (e.g. each hour). This I want to do with python. The scraped results should get inserted into an SQLite table. New info will be scraped but also 'old' information will get scraped again, since the python-script will run each hour.

To be more precise, I want to scrape a sports-result page, where more and more match-results get published on the same page as the tournament proceeds. So with each new scraping I just need the new results to be entered in the SQLite-table, since the older ones already got scraped (and inserted into the table) one hour before (or even earlier).

I also don't want to insert the same result twice, when it gets scraped the second time. So there should be some mechanism to check if one result already got scraped. Can this be done on SQL-level? So, that I scrape the whole page, make an INSERT statement for each result, but only those INSERT statements get executed successfully which were not present in the database before. I'm thinking of something like a UNIQUE keyword or so.

Or am I thinking too much about performance and should solve this by doing a DROP TABLE each time before I start scraping and then just scrape everything from scratch again? I don't talk about really much data. It's just about 100 records (= matches) for 1 tournament and about 50 tournaments a year.

Basically I would just be interested in some kind of best-practice approach.

like image 350
beta Avatar asked Apr 17 '13 14:04

beta


2 Answers

What you want to do is an upsert (update or insert if it doesn't exist). Check here to see how to do it in sqlite: SQLite UPSERT - ON DUPLICATE KEY UPDATE

like image 153
Tewfik Avatar answered Oct 02 '22 23:10

Tewfik


It looks like you want to insert data if it doesn't exist? Perhaps something like:

  1. Check if the entry exists
  2. Insert Data if it doesn't
  3. Update the entry if it does? (do you want to update)

You could issue 2 seperate sql statements SELECT then INSERT/UPDATE

Or You could set unique, and i beileve sqllite will raise IntegrityError

try:
  # your insert here
  pass
except sqlite.IntegrityError:
  # data is duplicate insert
  pass
like image 25
dm03514 Avatar answered Oct 03 '22 00:10

dm03514