Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiency of very large collections; iteration and sort

I have a csv parser that reads in 15+ million rows (with many duplicates), and once parsed into structs, need to be added to a collection. Each struct has properties Key (int), A(datetime), and B(int) (and others that aren't relevant here).

Requirement A: The collection needs to enforce uniqueness by a Key.

Requirement B: In a later step, I need the collection sorted by properties A(timestamp) then B(int).

Constraint: The structs eventually need to be traversed in order, one by one, with references to neighbors (a LinkedList presents the cleanest solution here); the point of this operation is to partition the set. Please assume that this is the earliest that partitioning can occur (ie, it cannot be partitioned at the parsing stage).

I've found that the SortedSet works quite well for Requirement A, and it's quite performant as well, even though the O(log n) insertions are much slower than with HashSet<T>'s O(1), though I don't care about sorting on the key. HashSet<T> gets bogged down when the collection gets huge, which apparently is a known issue, while SortedSet<T> does not suffer this drawback.

The problem: When I get to the step for Requirement B, sorting the collection (a SortedSet<T> passed to a method as IEnumerable<T>) takes a prohibitive amount of time (20+ minutes of grinding, all in-memory, no page file usage).

The question: Which collection(s) is(are) best suited to address this problem? One idea is to use two collections: one to enforce uniqueness (like a HashSet<int> or SortedSet<int> of keys), and a second SortedSet<T> to handle sorting at the parsing stage (ie, as far upstream as possible). But the application is already memory-intensive, and the performance penalties of needing the pagefile is prohibitive.
What options does that leave me with for a single collection that enforces uniqueness by one characteristic, but sorts by other unrelated characteristics? SortedSet<T> uses IComparer<T> (but not both IComparer<T> and IEquitable<T>), so if it relies on CompareTo to enforce uniqueness, then it doesn't seem to fit my requirements. Is subclassing SortedSet the way to go?

Edit: The sort code:

SortedSet<Dto> parsedSet = {stuff};
var sortedLinkedStructs = new LinkedList<Dto>(parsedSet.OrderBy(t => t.Timestamp).ThenBy(i => i.SomeInt));

The struct:

public readonly struct Dto: IEquatable<Dto>, IComparer<Dto>, IComparable<Dto>
{
     public readonly datetime Timestamp;
     public readonly int SomeInt;
     public readonly int Key;

     ctor(ts, int, key){assigned}

     public bool Equals(Dtoother) => this.Key == other.Key;
     public override int GetHashCode() => this.Key.GetHashCode();
     public int Compare(Dto x, Dto y) =>  x.Key.CompareTo(y.Key);
     public int CompareTo(Dto other) => this.Key.CompareTo(other.Key);
}
like image 430
Kevin Fichter Avatar asked Jan 19 '18 16:01

Kevin Fichter


1 Answers

This might not be a direct answer, but : it is a way that I've used successfully for a similar system of similar scale. This is for the "tag engine" that drives the question lists here on Stack Overflow; Essentially, I have a:

struct Question {
    // basic members - score, dates, id, etc - no text
}

and basically an oversized Question[] (actually I use a Question* in unmanaged memory, but that's because I need to be able to share it with some GPU code for unrelated reasons). Populating the data is just taking out successive rows in the Question[]. This data is never sorted - it is left alone as the source data - with just append (new key) or overwrite (same key); at worst we might need to reallocate and block-copy the data to a new array if we reach max capacity.

Now, instead of sorting that data, I separately keep an int[] (actually int* for the same reason as before, but... meh), where each value in the int[] is the index of the actual data in the Question[]. So initially it may be 0, 1, 2, 3, 4, 5, ... (although I pre-filter this, so it only contains the rows I want to keep - removing "deleted" etc).

using either a modifier parallel quicksort (see http://stackoverflow.com/questions/1897458/parallel-sort-algorithm) or a modified "introspective sort" (like here) - so at the end of the sort, I might have 0, 3, 1, 5, ....

Now: to iterate through the data, I just iterate through the int[], and use that as a lookup to the actual data in the Question[]. This minimizes the amount of data movement during a sort, and allows me to keep multiple separate sorts (perhaps with different pre-filters) very efficiently. It takes milliseconds only to sort the 15M data (which happens every minute or so to bring in new questions into Stack Overflow, or to note changes to existing questions).

To make the sort as fast as possible, I try to write my sort code such that a composite sort can be represented by a single integer value, allowing very effective sort (usable by the introspective sort). For example, here's the code for the "last activity date, then question id" sort:

public override bool SupportsNaturallySortableUInt64 => true;
public override unsafe ulong GetNaturallySortableUInt64(Question* question)
{
    // compose the data (MSB) and ID (LSB)
    var val = Promote(question->LastActivityDate) << 32
        | Promote(question->Id);
    return ~val; // the same as ulong.MaxValue - val (which reverses order) but much cheaper
}

This works by treating the LastActivityDate as a 32-bit integer, left shifting by 32 bits and composing it with the Id as a 32-bit integer, meaning we can compare the date and the id in a single operation.

Or for "score, then answer score, then id":

public override unsafe ulong GetNaturallySortableUInt64(Question* question)
{
    // compose the data
    var val = Promote(question->Score) << 48
        | Promote(question->AnswerScore) << 32
        | Promote(question->Id);
    return ~val; // the same as ulong.MaxValue - val (which reverses order) but much cheaper
}

Note that GetNaturallySortableUInt64 is only called once per element - into a working area of a ulong[] (yes, actually a ulong*) of the same size, so initially the two workspaces are something like:

int[]    ulong[]
0        34243478238974
1        12319388173
2        2349245938453
...      ...

Now I can do the entire sort by looking just at an int[] and a ulong[], such that the ulong[] vector ends up in the sorted order, and the int[] contains the indices of the items to look at.

like image 192
Marc Gravell Avatar answered Nov 09 '22 02:11

Marc Gravell