Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Most performant way to filter a huge List in C#?

We have an ASP.NET MVC web application that is hooked up to a SQL Server DB via Entity Framework. One of the main tasks of this application is to allow users to quickly search and filter a huge database table that holds archive values.

The table structure is quite simple: Timestamp (DateTime), StationId (int), DatapointId (int), Value (double). This table holds somewhat between 10 to 100 million rows. I optimized the DB table with a covering index etc., but the user experience was still quite laggy when filtering by DatapointId, StationId, Time and Skipping and Taking only the portion I want to show on the page.

So I tried a different approach: as our server has a lot of RAM, I thought that we could simply load the whole archive table into a List<ArchiveRow> when the web app starts up, then simply get the data directly from this List instead of doing a round-trip to the database. This works quite well, it takes about 9 seconds to load the whole archive table (with currently about 10 million entries) into the List. The ArchiveRow is a simple object that looks like this:

public class ArchiveResponse {
    public  int Length { get; set; }
    public int numShown { get; set; }
    public  int numFound { get; set; }
    public  int numTotal { get; set; }
    public  List<ArchiveRow> Rows { get; set; }
}

and accordingly:

public class ArchiveRow {
    public  int s { get; set; }
    public  int d { get; set; }
    public  DateTime t { get; set; }
    public  double v { get; set; }
}    

When I now try to get the desired data from the List with a Linq query, it is already faster the querying the DB, but it is still quite slow when filtering by multiple criteria. For example, when I filter by one StationId and 12 DatapointIds, it takes about 5 seconds to retrieve a window of 25 rows. I already switched from filtering with Where to using joins, but I think there is still room for improvement. Are there maybe better ways to implement such a caching mechanism while keeping memory consumption as low as possible? Are there other collection types that are better suited for this purpose?

So here is the code that filters and fetches the relevant data from the ArchiveCache list:

// Total number of entries in archive cache
var numTotal = ArchiveCache.Count();

// Initial Linq query
ParallelQuery<ArchiveCacheValue> query = ArchiveCache.AsParallel();

// The request may contain StationIds that the user is interested in,
// so here's the filtering by StationIds with a join:
if (request.StationIds.Count > 0)
{
    query = from a in ArchiveCache.AsParallel()
             join b in request.StationIds.AsParallel()
             on a.StationId equals b
             select a;
}

// The request may contain DatapointIds that the user is interested in,
// so here's the filtering by DatapointIds with a join:
if (request.DatapointIds.Count > 0)
{
    query = from a in query.AsParallel()
             join b in request.DatapointIds.AsParallel()
             on a.DataPointId equals b
             select a;
}

// Number of matching entries after filtering and before windowing
int numFound = query.Count();

// Pagination: Select only the current window that needs to be shown on the page
var result = query.Skip(request.Start == 0 ? 0 : request.Start - 1).Take(request.Length);

// Number of entries on the current page that will be shown
int numShown = result.Count();

// Build a response object, serialize it to Json and return to client
// Note: The projection with the Rows is not a bottleneck, it is only done to
// shorten 'StationId' to 's' etc. At this point there are only 25 to 50 rows,
// so that is no problem and happens in way less than 1 ms
ArchiveResponse myResponse = new ArchiveResponse();
myResponse.Length = request.Length;
myResponse.numShown = numShown;
myResponse.numFound = numFound;
myResponse.numTotal = numTotal;
myResponse.Rows = result.Select(x => new archRow() { s = x.StationId, d = x.DataPointId, t = x.DateValue, v = x.Value }).ToList();

return JsonSerializer.ToJsonString(myResponse);

Some more details: the number of stations is usually something between 5 to 50, rarely more than 50. The number of datapoints is <7000. The web application is set to 64 bit with <gcAllowVeryLargeObjects enabled="true" /> set in the web.config.

I'm really looking forward for further improvements and recommendations. Maybe there's a completely different approach based on arrays or similar, that performs way better without linq?

like image 767
Rob Avatar asked Nov 17 '17 10:11

Rob


1 Answers

You can adjust your storage to this specific query type. First, create dictionaries from your in-memory archive:

ArchiveCacheByDatapoint = ArchiveCache.GroupBy(c => c.DataPointId)
            .ToDictionary(c => c.Key, c => c.ToList());
ArchiveCacheByStation = ArchiveCache.GroupBy(c => c.StationId)
            .ToDictionary(c => c.Key, c => c.ToList());

Then use those dictionaries in your query:

bool hasStations = request.StationIds.Length > 0;
bool hasDatapoints = request.DatapointIds.Length > 0;            
int numFound = 0;
List<ArchiveCacheValue> result;
if (hasDatapoints && hasStations) {
    // special case - filter by both
    result = new List<ArchiveCacheValue>();
    // store station filter in hash set
    var stationsFilter = new HashSet<int>(request.StationIds);
    // first filter by datapoints, because you have more different datapoints than stations
    foreach (var datapointId in request.DatapointIds.OrderBy(c => c)) {                    
        foreach (var item in ArchiveCacheByDatapoint[datapointId]) {                        
            if (stationsFilter.Contains(item.StationId)) {
                // both datapoint and station matches filter - found item
                numFound++;
                if (numFound >= request.Start && result.Count < request.Length) {
                    // add to result list if matches paging criteria
                    result.Add(item);
                }
            }
        }
    }
}
else if (hasDatapoints) {                
    var query = Enumerable.Empty<ArchiveCacheValue>();                
    foreach (var datapoint in request.DatapointIds.OrderBy(c => c))
    {
        var list = ArchiveCacheByDatapoint[datapoint];
        numFound += list.Count;
        query = query.Concat(list);
    }
    // execute query just once
    result = query.Skip(request.Start).Take(request.Length).ToList();
}
else if (hasStations) {                
    var query = Enumerable.Empty<ArchiveCacheValue>();
    foreach (var station in request.StationIds.OrderBy(c => c))
    {
        var list = ArchiveCacheByStation[station];
        numFound += list.Count;
        query = query.Concat(list);
    }
    // execute query just once
    result = query.Skip(request.Start).Take(request.Length).ToList();
}
else {
    // no need to do Count()
    numFound = ArchiveCache.Count;
    // no need to Skip\Take here really, ArchiveCache is list\array
    // so you can use indexes which will be faster
    result = ArchiveCache.Skip(request.Start).Take(request.Length).ToList();
}

// Number of entries on the current page that will be shown
int numShown = result.Count;

I've measured it and on my machine it runs in 1ms (sometimes up to 10ms) for all types of queries I tried (by sections only, by datapoints only, by both sections and datapoints), for 100 million items.

like image 110
Evk Avatar answered Oct 06 '22 00:10

Evk