I have a cron job everyday to make a call to an API and fetch some data. For each row of the data I kick off a task queue to process the data (which involves looking up data via further APIs). Once all this has finished my data doesn't change for the next 24 hours so I memcache it.
Is there a way of knowing when all the tasks I queued up have finished so that I can cache the data?
Currently I do it in a really messy fashion by just scheduling two cron jobs like this:
class fetchdata(webapp.RequestHandler):
def get(self):
todaykey = str(date.today())
memcache.delete(todaykey)
topsyurl = 'http://otter.topsy.com/search.json?q=site:open.spotify.com/album&window=d&perpage=20'
f = urllib.urlopen(topsyurl)
response = f.read()
f.close()
d = simplejson.loads(response)
albums = d['response']['list']
for album in albums:
taskqueue.add(url='/spotifyapi/', params={'url':album['url'], 'score':album['score']})
class flushcache(webapp.RequestHandler):
def get(self):
todaykey = str(date.today())
memcache.delete(todaykey)
Then my cron.yaml looks like this:
- description: gettopsy
url: /fetchdata/
schedule: every day 01:00
timezone: Europe/London
- description: flushcache
url: /flushcache/
schedule: every day 01:05
timezone: Europe/London
Basically - I'm making a guess that all my tasks won't take more than 5 minutes to run so I just flush the cache 5 mins later and this ensures that when the data is cached it's complete.
Is there a better way of coding this? Feels like my solution isn't the best one....
Thanks Tom
There's not currently any way to determine when your tasks have finished executing. Your best option would be to insert marker records in the datastore, and have each task delete its record when it's done. Each task can then check if it's the last task, and perform your cleanup / caching if it is.
i found this question while dealing with the same issue. i came up with a different solution which i'm posting here in case it's useful to others.
this isn't a direct replacement for what you are asking, but it's related - my problem was that i wanted to know when a queue was empty because that means that a complex background process had finished running. so i could replace checking the queue size with checking a "deadman timer"
a deadman time is a timer that is constantly reset by some process. when that process finishes then the timer is not reset and eventually expires. so i had all the different tasks that formed part of my complex background process reset the timer and, instead of checking when the queue was empty, i had a cron job that checked when the timer expired.
of course, for this to be efficient, the timer has to avoid writing to the data store all the time. the code at http://acooke.org/cute/Deadmantim0.html avoids this by relaxing the behaviour slightly and using memcache to hold a copy of the timer object and to only reset it in the store after a significant amount of time has passed.
ps this is more efficient than what you describe because it doesn't need to write as often to the database. it's also more robust because you don't have to keep track exactly of what is happening.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With