Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Strategy for caching of remote service; what should I be considering?

My web app contains data gathered from an external API of which I do not have control. I'm limited to about 20,000 API requests per hour. I have about 250,000 items in my database. Each of these items is essentially a cached version. Consider that it takes 1 request to update the cache of 1 item. Obviously, it is not possible to have a perfectly up-to-date cache under these circumstances. So, what things should I be considering when developing a strategy for caching the data. These are the things that come to mind, but I'm hoping someone has some good ideas I haven't thought of.

  • time since item was created (less time means more important)
  • number of 'likes' a particular item has (could mean higher probability of being viewed)
  • time since last updated

A few more details: the items are photos. Every photo belongs to an event. Events that are currently occurring are more like to be viewed by client (therefore they should take priority). Though I only have 250K items in database now, that number increases rather rapidly (it will not be long until 1 million mark is reached, maybe 5 months).

like image 259
celwell Avatar asked Jun 12 '13 23:06

celwell


People also ask

What is the best caching strategy?

Cache-Aside (Lazy Loading) A cache-aside cache is the most common caching strategy available. The fundamental data retrieval logic can be summarized as follows: When your application needs to read data from the database, it checks the cache first to determine whether the data is available.

What are the two main strategies of caching in AWS?

Lazy loading allows for stale data but doesn't fail with empty nodes. Write-through ensures that data is always fresh, but can fail with empty nodes and can populate the cache with superfluous data.

Which caching strategy should be used if the website frequently needs updated data?

Write-through. In a write-through cache, the cache is updated in real time when the database is updated.

When should you consider caching?

Cache can be used to store less frequent data also if you really need fast access to that data. We use cache to access the data very fast, so storing most frequent / least frequent data is just a matter of use case.


2 Answers

Would http://instagram.com/developer/realtime/ be any use? It appears that Instagram is willing to POST to your server when there's new (and maybe updated?) images for you to check out. Would that do the trick?

Otherwise, I think your problem sounds much like the problem any search engine has—have you seen Wikipedia on crawler selection criteria? You're dealing with many of the problems faced by web crawlers: what to crawl, how often to crawl it, and how to avoid making too many requests to an individual site. You might also look at open-source crawlers (on the same page) for code and algorithms you might be able to study.

Anyway, to throw out some thoughts on standards for crawling:

  • Update the things that have changed often when updated. So, if an item hasn't changed in the last five updates, then maybe you could assume it won't change as often and update it less.
  • Create a score for each image, and update the ones with the highest scores. Or the lowest scores (depending on what kind of score you're using). This is a similar thought to what is used by LilyPond to typeset music. Some ways to create input for such a score:
    • A statistical model of the chance of an image being updated and needing to be recached.
    • An importance score for each image, using things like the recency of the image, or the currency of its event.
  • Update things that are being viewed frequently.
  • Update things that have many views.
  • Does time affect the probability that an image will be updated? You mentioned that newer images are more important, but what about the probability of changes on older ones? Slow down the frequency of checks of older images.
  • Allocate part of your requests to slowly updating everything, and split up other parts to process results from several different algorithms simultaneously. So, for example, have the following (numbers are for show/example only--I just pulled them out of a hat):
    • 5,000 requests per hour churning through the complete contents of the database (provided they've not been updated since the last time that crawler came through)
    • 2,500 requests processing new images (which you mentioned are more important)
    • 2,500 requests processing images of current events
    • 2,500 requests processing images that are in the top 15,000 most viewed (as long as there has been a change in the last 5 checks of that image, otherwise, check it on a decreasing schedule)
    • 2,500 requests processing images that have been viewed at least
    • Total: 15,000 requests per hour.
like image 169
andyg0808 Avatar answered Oct 18 '22 22:10

andyg0808


How many (unique) photos / events are viewed on your site per hour? Those photos that are not viewed probably don't need to be updated often. Do you see any patterns in views for old events / phones? Old events might not be as popular so perhaps they don't have to be checked that often.

andyg0808 has good detailed information however it is important to know the patterns of your data usage before applying in practice.

At some point you will find that 20,000 API requests per hour will not be enough to update frequently viewed photos, which might lead you to different questions as well.

like image 31
Alex S Avatar answered Oct 18 '22 21:10

Alex S