I want to scrape some specific webpages on a regular basis (e.g. each hour). This I want to do with python. The scraped results should get inserted into an SQLite table. New info will be scraped but also 'old' information will get scraped again, since the python-script will run each hour.
To be more precise, I want to scrape a sports-result page, where more and more match-results get published on the same page as the tournament proceeds. So with each new scraping I just need the new results to be entered in the SQLite-table, since the older ones already got scraped (and inserted into the table) one hour before (or even earlier).
I also don't want to insert the same result twice, when it gets scraped the second time. So there should be some mechanism to check if one result already got scraped. Can this be done on SQL-level? So, that I scrape the whole page, make an INSERT
statement for each result, but only those INSERT
statements get executed successfully which were not present in the database before. I'm thinking of something like a UNIQUE
keyword or so.
Or am I thinking too much about performance and should solve this by doing a DROP TABLE
each time before I start scraping and then just scrape everything from scratch again? I don't talk about really much data. It's just about 100 records (= matches) for 1 tournament and about 50 tournaments a year.
Basically I would just be interested in some kind of best-practice approach.
What you want to do is an upsert (update or insert if it doesn't exist). Check here to see how to do it in sqlite: SQLite UPSERT - ON DUPLICATE KEY UPDATE
It looks like you want to insert data if it doesn't exist? Perhaps something like:
You could issue 2 seperate sql statements SELECT then INSERT/UPDATE
Or You could set unique, and i beileve sqllite will raise IntegrityError
try:
# your insert here
pass
except sqlite.IntegrityError:
# data is duplicate insert
pass
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With