Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How should I store Dynamically Changing Data into Server Cache?

EDIT: Purpose of this Website: Its called Utopiapimp.com. It is a third party utility for a game called utopia-game.com. The site currently has over 12k users to it an I run the site. The game is fully text based and will always remain that. Users copy and paste full pages of text from the game and paste the copied information into my site. I run a series of regular expressions against the pasted data and break it down. I then insert anywhere from 5 values to over 30 values into the DB based on that one paste. I then take those values and run queries against them to display the information back in a VERY simple and easy to understand way. The game is team based and each team has 25 users to it. So each team is a group and each row is ONE users information. The users can update all 25 rows or just one row at a time. I require storing things into cache because the site is very slow doing over 1,000 queries almost every minute.

So here is the deal. Imagine I have an excel EDIT(Excel is just an example of how to imagine it, I don't actually use excel) spreadsheet with 100 columns and 5000 rows. Each row has two unique identifiers. One for the row it self and one to group together 25 rows a piece. There are about 10 columns in the row that will almost never change and the other 90 columns will always be changing. We can say some will even change in a matter of seconds depending on how fast the row is updated. Rows can also be added and deleted from the group, but not from the database. The rows are taken from about 4 queries from the database to show the most recent and updated data from the database. So every time something in the database is updated, I would also like the row to be updated. If a row or a group has not been updated in 12 or so hours, it will be taken out of Cache. Once the user calls the group again via the DB queries. They will be placed into Cache.

The above is what I would like. That is the wish.

In Reality, I still have all the rows, but the way I store them in Cache is currently broken. I store each row in a class and the class is stored in the Server Cache via a HUGE list. When I go to update/Delete/Insert items in the list or rows, most the time it works, but sometimes it throws errors because the cache has changed. I want to be able to lock down the cache like the database throws a lock on a row more or less. I have DateTime stamps to remove things after 12 hours, but this almost always breaks because other users are updating the same 25 rows in the group or just the cache has changed.

This is an example of how I add items to Cache, this one shows I only pull the 10 or so columns that very rarely change. This example all removes rows not updated after 12 hours:

DateTime dt = DateTime.UtcNow;
    if (HttpContext.Current.Cache["GetRows"] != null)
    {
        List<RowIdentifiers> pis = (List<RowIdentifiers>)HttpContext.Current.Cache["GetRows"];
        var ch = (from xx in pis
                  where xx.groupID == groupID 
                  where xx.rowID== rowID
                  select xx).ToList();
        if (ch.Count() == 0)
        {
            var ck = GetInGroupNotCached(rowID, groupID, dt); //Pulling the group from the DB
            for (int i = 0; i < ck.Count(); i++)
                pis.Add(ck[i]);
            pis.RemoveAll((x) => x.updateDateTime < dt.AddHours(-12));
            HttpContext.Current.Cache["GetRows"] = pis;
            return ck;
        }
        else
            return ch;
    }
    else
    {
        var pis = GetInGroupNotCached(rowID, groupID, dt);//Pulling the group from the DB
        HttpContext.Current.Cache["GetRows"] = pis;
        return pis;
    }

On the last point, I remove items from the cache, so the cache doesn't actually get huge.

To re-post the question, Whats a better way of doing this? Maybe and how to put locks on the cache? Can I get better than this? I just want it to stop breaking when removing or adding rows.

EDIT: The code SQLCacheDependency does NOT work for LINQ as posted in the comments of Remus. It works for a full table select, but I want to select just certain columns from the rows. I don't want to select Entire Rows, so I cannot use Remus's Idea.

Neither of the following code samples work.

var ck = (from xx in db.GetInGroupNotCached
              where xx.rowID== rowID
              select new {                 
                  xx.Item,
                  xx.AnotherItem,
                  xx.AnotherItem,
                  }).CacheSql(db, "Item:" + rowID.ToString()).ToList();


var ck = (from xx in db.GetInGroupNotCached
              where xx.rowID== rowID
              select new ClassExample {              
                Item=  xx.Item,
                 AnotherItem= xx.AnotherItem,
                 AnotherItemm = xx.AnotherItemm,
                  }).CacheSql(db, "Item:" + rowID.ToString()).ToList();
like image 455
SpoiledTechie.com Avatar asked Apr 07 '10 17:04

SpoiledTechie.com


2 Answers

I really doubt your caching solution is actually of any use. List<T> cannot be indexed, a lookup in your list is therefore always a O(n) operation.

Assuming you have profiled your application and know the database is your bottleneck, this is what you can do:

In a database you can create indexes on your data, a lookup on them will exhibit O(log(n)) typically. You should create coverage indexes for queries that include your static data. Leave the frequently changing data non-indexed, because this would slow down inserts and updates due to necessary index updates. You can read upon SQL Server indexing here. Get your hands on the SQL Server Profiler and check which are the slowest queries and why. Proper indexes can get you huge performance gains (e.g. an index on your GroupId will cut down the lookup time from a full table scan O(n) to an index lookup of O(n/25), assuming there are 25 people per group).

More than often, people write suboptimal SQL (returning unnecessary columns, Select N+1, cartesian joins). You should check that too.

Before implementing a cache, I would make sure your database is really the culprit for your performance problems. Premature Optimization is the root of all evil, caching is hard to do right. Caching frequently changing data is not what caching is intended for.

like image 96
Johannes Rudolph Avatar answered Oct 25 '22 15:10

Johannes Rudolph


In general, the reason for caching is that you feel you can pull the data out of memory (without it being stale) faster than you can pull it from the database. A situation where you can pull the right data from Cache is a Cache Hit. If your schema has a low Cache Hit rate, then Cache is probably hurting more than helping. If your data changes rapidly, you will have a low Cache Hit rate and it will be slower than simply querying for the data.

The trick is to split your data between infrequently changing and frequently changing elements. Cache the infrequently changing elements and do not cache the frequently changing elements. This could even be done at the database level on a single entity by using a 1:1 relationship where one of the tables contains the infrequently changing data and other the frequently changing information.You said that your source data would contain 10 columns that almost never change and 90 that change frequently. Build your objects around that notion so that you can cache the 10 that rarely change and query for the 90 that change frequently.

I store each row in a class and the class is stored in the Server Cache via a HUGE list

From your original post, it sounds like you are not storing each instance in cache, but instead a list of instances in cache as a single entry. The problem is that you can get multi-threading issues in this design. When multiple threads pull the one-list-to-rule-them-all, they are all accessing the same instance in memory (assuming they are on the same server). Furthermore, as you have discovered, the CacheDependency will not work in this design because it will expire the entire list rather than a single item.

One obvious, but highly problematic, solution would be to change your design to store each instance in memory with a logical cache key of some sort and add a CacheDependency for each instance. The problem is that if the number of instances is large, that will create a lot of overhead in the system verifying currency of each of the instances and expiring when necessary. If the cache items are polling the database, that will also create a lot of traffic.

An approach I have used to solve the problem of having a large number of database dependent CacheDependencies is to make a custom ICacheItemExpiration in the CachingBlock from the Enterprise Library. This also meant I was using the CachingBlock to do caching of my objects and not the ASP.NET cache directly. In this variant, I created a class called a DatabaseExpirationManager which kept track of which items to expire from cache. I would still add each item to the cache individually but with but with this modified dependency which simply registered the item with the DatabaseExpirationManager. The DatabaseExpirationManager would be notified of the keys that need to be expired and would expire the items from cache. I will say, right from the start, that this solution will probably not work on rapidly changing data. DatabaseExpirationManager would be running constantly holding a lock on its list of items to expire and preventing new items from being added. You would have to do some serious multi-threading analysis to ensure that you reduced contention while not enabling a race condition.

ADDITION

Ok. First, fair warning that this will be a long post. Second, this is not even the entire library as that would be too long.

Taking the wayback machine, I wrote this code in early and late-2005/early-2006 right as .NET 2.0 came out and I haven't investigated whether the more recent libraries might be doing this better (almost assuredly they are). I was using the January 2005/May 2005/January 2006 libraries. You can still get the 2006 library off CodePlex.

The way I came up with this solution was to look at the source of the Caching system in the Enterprise Library. In short, everything fed through the CacheManager class. That class has three primary components (all three are in the Microsoft.Practices.EnterpriseLibrary.Caching namespace): Cache BackgroundScheduler ExpirationPollTimer

The Cache class is the EntLib's implementation of cache. The BackgroundScheduler was used to scavenge the cache on a separate thread. The ExpirationPollTimer was a wrapper around a Timer class.

So, first off, it should be noted that the Cache scavenges itself based on a timer. Similarly, my solution would poll the database on a timer. The EntLib cache and the ASP.NET cache both work on the individual items having a delegate to check when the item should be expired. My solution worked on the premise of an outside entity checking when the items should be expired. The second thing to note is that whenever you start playing around with a central cache, you have to be attentive to multi-threading issues.

First I replaced the BackgroundScheduler with two classes: DatabaseExpirationWorker and DatabaseExpirationManager. DatabaseExpirationManager contained the important method that queried the database for changes and passed the list of changes to an event:

private object _syncRoot = new object();
private List<Guid>  _objectChanges = new List<Guid>();
public event EventHandler<DatabaseExpirationEventArgs> ExpirationFired;
...
public void UpdateExpirations()
{
    lock ( _syncRoot )
    {
        DataTable dt = GetExpirationsFromDb();
        List<Guid> keys = new List<Guid>();
        foreach ( DataRow dr in dt.Rows )
        {
            Guid key = (Guid)dr[0];
            keys.Add(key);
            _objectChanges.Add(key);
        }

        if ( ExpirationFired != null )
            ExpirationFired(this, new DatabaseExpirationEventArgs(keys));
    }
}

The DatabaseExpirationEventArgs class looked like so:

public class DatabaseExpirationEventArgs : System.EventArgs
{
    public DatabaseExpirationEventArgs( List<Guid> expiredKeys )
    {
        _expiredKeys = expiredKeys;
    }

    private List<Guid> _expiredKeys;
    public List<Guid> ExpiredKeys
    {
        get  {  return _expiredKeys;  }
    }
}

In this database, all the primary keys were Guids. This make keeping track of changes substantially simpler. Each of the save methods in the middle tier would write their PK and the current datetime into a table. Each time the system polled the database, it stored the datetime (from the database. not from the middle-tier) that it initiated the polling and GetExpirationsFromDb would return all items that had changed since that time. Another method would periodically remove rows that had long since been polled. This table of changes was very narrow: a guid and a datetime (with a PK on both columns and the clustered index on datetime IIRC). Thus, it could be queried very quickly. Also note that I used the Guid as the key in the Cache.

The DatabaseExpirationWorker class was nearly identical to the BackgroundScheduler except that its DoExpirationTimeoutExpired would call the DatabaseExpirationManager UpdateExpirations method. Since none of the methods in BackgroundScheduler were virtual, I could not simply derive from BackgroundScheduler and override its methods.

The last thing I did was to write my own version of the EntLib's CacheManager that used my DatabaseExpirationWorker instead of the BackgroundScheduler and its indexer would check the object expiration list:

private List<Guid> _objectExpirations;
private void OnExpirationFired( object sender, DatabaseExpirationEventArgs e )
{
    _objectExpirations = e.ExpiredKeys;
    lock(_objectExpirations)
    {
        foreach( Guid key in _objectExpirations)
            this.RealCache.Remove(key);
    }
}

private Microsoft.Practices.EnterpriseLibrary.Caching.CacheManager _realCache;
private Microsoft.Practices.EnterpriseLibrary.Caching.CacheManager RealCache
{
    get
    {
        lock(_syncRoot)    
        {       
            if ( _realCache == null )
                _realCache = Microsoft.Practices.EnterpriseLibrary.Caching.CacheManager.CacheFactory.GetCacheManager();

            return _realCache;
        }
    }
}


public object this[string key]
{
    get
    {
        lock(_objectExpirations)
        {
            if (_objectExpirations.Contains(key))
                return null;
            return this.RealCache.GetData(key);
        }
    }
}

Again, it's many moons since I reviewed this code but this gives you the jist of it. Even looking through my old code, I see many places that can be cleaned up and cleared up. I also have not looked at the Caching block in the most recent version of the EntLib but I would imagine it has changed and improved. Keep in mind that in the system in which I built this, there were dozens of changes per second not hundreds. So, if the data was stale for a minute or two, that was acceptable. If in your solution there thousands of changes per second then this solution may not feasible.

like image 21
Thomas Avatar answered Oct 25 '22 14:10

Thomas