Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Caching vs Indexing

Tags:

What's the real difference between a caching solution and an indexing solution? It seems to me that an indexing solution is in fact caching with the ability to run search queries (like: Elastic Search). Would there ever be any real reason to use both a caching solution and indexing solution within the same project or does the indexing solution basically make any other caching redundant?

Example: Say I use NEST for ElasticSearch, which would store and return POCOs; if I then query ElasticSearch and have the POCO returned to me, isn't that considered to be using a cached object returned from ElasticSearch?

At the moment, I store data in a cache using an ICacheManager interface I have.. something like this:

return CacheManager.Get(cacheKey, () => {     // return something... }); 

Would this become redundant with ElasticSearch?

EDIT

Thanks to all of you for the answers. I am fully aware of what a cache is and already understood the general idea behind an index for textual searching, so I was only really wondering whether the index doubles as a cache already and would therefore make any other cache redundant. After all, I wouldn't want to keep 2 caches in memory (example: ElasticSearch + Redis) when one would do fine. I think I have a better idea now though; especially when I realized that not all fields are always stored in the index and so therefore we need to get the object from a cache or direct from the db anyway - at least in some cases. Thanks all!

like image 667
Matt Avatar asked Dec 20 '15 21:12

Matt


People also ask

What is the difference between cache and indexing?

When any search engine visits a web page, it effectively makes a copy of that page which is stored in the index. The page index means search engine read the web site content and cache means it is the cached webpage that google has last seen when they visited your website.

Is caching good for SEO?

Page caching is another method that can help you to improve the load time of your web pages and thus optimize your site for the search engines. Page load time can significantly impact your user experience and your site's ability to convert visitors into buyers or into leads.


2 Answers

The whole purpose of a cache is to return already requested data as fast as possible. One constraint of caches is that they cannot be too big either as the lookup time would increase and thus defeat the purpose of having a cache in the first place. That being said, it comes as no surprise that if you plan to have a few million/billion records in your DB, it won't be difficult to index them all but it will be difficult to cache them all, though since RAM is getting cheaper and cheaper, you might be able to store all you need in memory. You also need to ask yourself whether your cache needs to be distributed across several hosts or not (whether now or in the future).

Considering that lookups and queries in ES are extremely fast (+ ES brings you many more benefits in addition to that, such as scoring), i.e. usually faster than retrieving the same data from your DB, it would make sense to use ES as a cache. One issue I see is a common one, i.e. as soon as you start duplicating data (DB -> ES), you need to ensure that both stores don't get out of synch.

Now, if in addition you throw a cache into that mix, it's a third data store to maintain and to ensure is consistent with the main data store. If you know your data is pretty stable, i.e. written and then not updated frequently, then that might be ok, but you need to keep this very concern in mind all the time when designing your data access strategy.

As @paweloque said, in the end it all depends on your exact use case(s). Every problem is different and I can attest that after a few dozen projects around ES over the past five years or so, I've never seen two projects configured the same way. A cache might make sense for some specific cases, but not at all for others.

You need to think hard how and where you need to store your data, who is requesting them (and at what rate), who is creating/updating them (and at what rate), but in the end, the best practice is to keep your stack as lean as possible with only as few components as needed, each one being a potential bottleneck that you have to understand, integrate, maintain, tune and monitor.

Finally, I'd add one more thing: adding a cache or an index should be considered a performance optimization of your software stack. As you probably know the common saying "Premature optimization is root of all evil", you should first go with your database only, measure the performance, load test it, then witness that it might not support the load. Then only, you can decide to throw a cache at it and/or an index depending on the needs. Again, load test, measure, then decide. If you only have ten users making a few requests per day, having only a DB might be perfectly fine. You have to understand when and why you need to add another layer on your Tower of Babel, but most importantly you need to add one layer at a time and see how that layer improves/degrades the stability of the stack.

Last but not least, you can find some online articles from people having used ES as caches (mainly key-value stores, and object caches).

like image 96
Val Avatar answered Sep 27 '22 22:09

Val


Your question:

Q. What's the real difference between a caching solution and an indexing solution?

A. The simple difference is that cache is used to store frequently used data to serve the same requests faster. In essence your cache is faster than your main store but is lower in size, therefore, data it can store (considering the common that it would be more expensive)

Indexing is made on all of the data to make it searchable faster. A simple Hashtable/HashMap have hash's as indexes and in an Array the 0s and 1s are the indexes.

You can index some columns to search them faster. But cache is the place you would want to have your data to fetch them faster. Normally Cache is the RAM and database is from HardDisk

Cache is also usually a key value store so if you know the key then fetch it from the cache, no need to run a query. In NHibernate and EntityFrameworks, Query caches are plugged in with queries as keys and all the data is cached. Now your queries will be fetched from the cache instead of running it through the database.

like image 26
Basit Anwer Avatar answered Sep 27 '22 22:09

Basit Anwer