My web app contains data gathered from an external API of which I do not have control. I'm limited to about 20,000 API requests per hour. I have about 250,000 items in my database. Each of these items is essentially a cached version. Consider that it takes 1 request to update the cache of 1 item. Obviously, it is not possible to have a perfectly up-to-date cache under these circumstances. So, what things should I be considering when developing a strategy for caching the data. These are the things that come to mind, but I'm hoping someone has some good ideas I haven't thought of. <ul> <li>time since item was created (less time means more important)</li> <li>number of 'likes' a particular item has (could mean higher probability of being viewed)</li> <li>time since last updated</li> </ul> A few more details: the items are photos. Every photo belongs to an event. Events that are currently occurring are more like to be viewed by client (therefore they should take priority). Though I only have 250K items in database now, that number increases rather rapidly (it will not be long until 1 million mark is reached, maybe 5 months).

Would http://instagram.com/developer/realtime/ be any use? It appears that Instagram is willing to POST to your server when there's new (and maybe updated?) images for you to check out. Would that do the trick? Otherwise, I think your problem sounds much like the problem any search engine has—have you seen Wikipedia on crawler selection criteria? You're dealing with many of the problems faced by web crawlers: what to crawl, how often to crawl it, and how to avoid making too many requests to an individual site. You might also look at open-source crawlers (on the same page) for code and algorithms you might be able to study. Anyway, to throw out some thoughts on standards for crawling: <ul> <li>Update the things that have changed often when updated. So, if an item hasn't changed in the last five updates, then maybe you could assume it won't change as often and update it less.</li> <li>Create a score for each image, and update the ones with the highest scores. Or the lowest scores (depending on what kind of score you're using). This is a similar thought to what is used by LilyPond to typeset music. Some ways to create input for such a score: <ul> <li>A statistical model of the chance of an image being updated and needing to be recached.</li> <li>An importance score for each image, using things like the recency of the image, or the currency of its event.</li> </ul> </li> <li>Update things that are being viewed frequently.</li> <li>Update things that have many views.</li> <li>Does time affect the probability that an image will be updated? You mentioned that newer images are more important, but what about the probability of changes on older ones? Slow down the frequency of checks of older images.</li> <li>Allocate part of your requests to slowly updating everything, and split up other parts to process results from several different algorithms simultaneously. So, for example, have the following (numbers are for show/example only--I just pulled them out of a hat): <ul> <li>5,000 requests per hour churning through the complete contents of the database (provided they've not been updated since the last time that crawler came through)</li> <li>2,500 requests processing new images (which you mentioned are more important)</li> <li>2,500 requests processing images of current events</li> <li>2,500 requests processing images that are in the top 15,000 most viewed (as long as there has been a change in the last 5 checks of that image, otherwise, check it on a decreasing schedule)</li> <li>2,500 requests processing images that have been viewed at least </li> <li> Total: 15,000 requests per hour.</li> </ul> </li> </ul>

How many (unique) photos / events are viewed on your site per hour? Those photos that are not viewed probably don't need to be updated often. Do you see any patterns in views for old events / phones? Old events might not be as popular so perhaps they don't have to be checked that often. andyg0808 has good detailed information however it is important to know the patterns of your data usage before applying in practice. At some point you will find that 20,000 API requests per hour will not be enough to update frequently viewed photos, which might lead you to different questions as well.

Strategy for caching of remote service; what should I be considering?

Tags:

database

optimization

caching

api

My web app contains data gathered from an external API of which I do not have control. I'm limited to about 20,000 API requests per hour. I have about 250,000 items in my database. Each of these items is essentially a cached version. Consider that it takes 1 request to update the cache of 1 item. Obviously, it is not possible to have a perfectly up-to-date cache under these circumstances. So, what things should I be considering when developing a strategy for caching the data. These are the things that come to mind, but I'm hoping someone has some good ideas I haven't thought of.

time since item was created (less time means more important)
number of 'likes' a particular item has (could mean higher probability of being viewed)
time since last updated

A few more details: the items are photos. Every photo belongs to an event. Events that are currently occurring are more like to be viewed by client (therefore they should take priority). Though I only have 250K items in database now, that number increases rather rapidly (it will not be long until 1 million mark is reached, maybe 5 months).

259

asked Jun 12 '13 23:06

celwell

2 Answers

Would http://instagram.com/developer/realtime/ be any use? It appears that Instagram is willing to POST to your server when there's new (and maybe updated?) images for you to check out. Would that do the trick?

Otherwise, I think your problem sounds much like the problem any search engine has—have you seen Wikipedia on crawler selection criteria? You're dealing with many of the problems faced by web crawlers: what to crawl, how often to crawl it, and how to avoid making too many requests to an individual site. You might also look at open-source crawlers (on the same page) for code and algorithms you might be able to study.

Anyway, to throw out some thoughts on standards for crawling:

Update the things that have changed often when updated. So, if an item hasn't changed in the last five updates, then maybe you could assume it won't change as often and update it less.
Create a score for each image, and update the ones with the highest scores. Or the lowest scores (depending on what kind of score you're using). This is a similar thought to what is used by LilyPond to typeset music. Some ways to create input for such a score:
- A statistical model of the chance of an image being updated and needing to be recached.
- An importance score for each image, using things like the recency of the image, or the currency of its event.
Update things that are being viewed frequently.
Update things that have many views.
Does time affect the probability that an image will be updated? You mentioned that newer images are more important, but what about the probability of changes on older ones? Slow down the frequency of checks of older images.
Allocate part of your requests to slowly updating everything, and split up other parts to process results from several different algorithms simultaneously. So, for example, have the following (numbers are for show/example only--I just pulled them out of a hat):
- 5,000 requests per hour churning through the complete contents of the database (provided they've not been updated since the last time that crawler came through)
- 2,500 requests processing new images (which you mentioned are more important)
- 2,500 requests processing images of current events
- 2,500 requests processing images that are in the top 15,000 most viewed (as long as there has been a change in the last 5 checks of that image, otherwise, check it on a decreasing schedule)
- 2,500 requests processing images that have been viewed at least
- Total: 15,000 requests per hour.

169

answered Oct 18 '22 22:10

andyg0808

How many (unique) photos / events are viewed on your site per hour? Those photos that are not viewed probably don't need to be updated often. Do you see any patterns in views for old events / phones? Old events might not be as popular so perhaps they don't have to be checked that often.

andyg0808 has good detailed information however it is important to know the patterns of your data usage before applying in practice.

At some point you will find that 20,000 API requests per hour will not be enough to update frequently viewed photos, which might lead you to different questions as well.

answered Oct 18 '22 21:10

Alex S

Related questions
                            
                                SVN database versioning for multiple developers environment
                            
                                Database model to object oriented design?
                            
                                Transaction Isolation Level
                            
                                Odd behaviour when doing LIKE with wildcards searching for backslash in MySQL
                            
                                What characters are allowed in Oracle bind param placeholders?
                            
                                Is it possible to use RDF storage also as a document-oriented database?
                            
                                Database design / normalization structure needs to contain ANDs, ORs, optional elements and their relationships
                            
                                How to specify schema name while running "syncdb" in django?
                            
                                Should I use a text file or Database?
                            
                                Oracle external tables - Specifying dynamic filename
                            
                                How to model cities with aliases in MySQL
                            
                                How to import Access 2010 database (.accdb) into MySQL [duplicate]
                            
                                Is there any equivalent to Postgresql EVERY aggregate function on other RDBMS?
                            
                                Does Django ORM have an equivalent to SQLAlchemy's Hybrid Attribute?
                            
                                HTML5 App Database Syncing
                            
                                How to find whether index is in RAM or Disk [MongoDB]?
                            
                                SQLite delete from only if the table exists
                            
                                Work-around for Struct with Entity Framework, Code-First approach
                            
                                API for retrieving/send data from/to a database
                            
                                skype main.db - difference between Chats and Conversations

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With