My data structures knowledge is way rusty, and to be honest it was never my strongest point.
Right now we're about to build a queue-like component that has the following requirements:
I think that sums it up. so, obviously a single list or ordered list is out of the question, due to the fact that every time we add or remove objects from the collection it gets sorted again, and doing this in a single collection with a million objects is slow.
We tested a couple of approaches in the past, like linked lists, which proved to be fast for queueing, but slow for finding items (since we do have that scenario).
Right now we've come up with a structure like
SortedDictionary<int, SortedDictionary<int, SortedDictionary<int, SortedDictionary<int, SortedDictionary<int, ..
You get the idea.
It's kind of a sweet spot of grouping levels, keeping each collection relatively small (around 300 items per dictionary).
so, for the first level, we will have a sorteddictionary, for which the keys are the ids of each master category, and the values will be a sorteddictionary of which the key will be the id of the child category...and so on.
Right now we've tested with 100, 1,000, 10,000, 100,000 and 1,000,000 items.
For the smaller range, up to 100k, the solution is fast. it can queue/dequeue/find in less than a second even for up to 300k, which is really above the 80-90% of the scenarios that we will handle.
When it comes to a million, it does become slower, taking about 3-4 seconds to queue the whole thing, and up to 10 seconds to deplete the queue.
So, my questions are:
Thanks.
In Java, dynamically allocated data structures (such as ArrayList , LinkedList , Vector , Stack , HashSet , HashMap , Hashtable ) are supported in a unified architecture called the Collection Framework, which mandates the common behaviors of all the classes.
A few remarks and general suggestions (I'm sorry that I don't have the time to test myself):
I believe that your general approach - nested (sorted)dictionaries - is sound. I use similar structures very often, to my satisfaction - not for performance reasons because my cases are always small, but for clarity and flexibility.
In your case it also addresses one of the performance issues, because sorting (at inserting and at deleting) takes time and smaller (sub)dictionaries obviously sort faster.
Yes, having class-instances as values (or keys) only uses the reference, so in that respect it doesn't really matter what size or structure your class has.
The increasing time for deleting (and adding) presumably is caused (primarily) by the resorting that is done every time you remove (or add) an item.
As regards the performance of adding items:
If your case enables you to feed the dictionaries with items in a sorted (ascending) order, you may want to switch to a SortedLIST, rather than a SortedDICTIONARY, because in the list adding items is O(1) rather than O(log n) if the new items will wind up at the end of the sorted collection.
A collection has an initial CAPACITY - the reserved space for items. Adding items beyond the current capacity of the collection results in (a) increasing the capacity-value of the collection; (b) reallocating space for the (whole) new capacity; (c) copying the values from the original (small) storage to the new one. Therefore, if you have some idea about how large your collections are going to be: initialize the collection with the appropriate capacity. This is not (yet) possible with a sortedDictionary, but the sortedLIST supports that feature.
As regards the performance of deleting items:
You may want to consider using an approach that uses a custom-class wrapping the sorted collection, such that it doesn't "really" delete items (thereby avoiding the resorting), until the whole collection is empty.
All in all, have a try with something along the following lines (using vb syntax; I'm sure you can translate it to C#,C+ or whatever language you wish to use.
Public Class MySortedCollection(Of myKeyType, myValueType)
Public Sub New(Optional myCapacity As Integer = 0)
If myCapacity <= 0 Then
MyValues = New SortedList(Of myKeyType, myValueType)
ValidItems = New Dictionary(Of myKeyType, Boolean)
Else
MyValues = New SortedList(Of myKeyType, myValueType)(myCapacity)
ValidItems = New Dictionary(Of myKeyType, Boolean)(myCapacity)
End If
LiveItemsCount = 0
End Sub
Private MyValues As SortedList(Of myKeyType, myValueType)
Private ValidItems As Dictionary(Of myKeyType, Boolean)
Private LiveItemsCount As Integer
Public ReadOnly Property Count As Integer
Get
Return LiveItemsCount
End Get
End Property
Public Sub Clear()
MyValues.Clear()
ValidItems.Clear()
LiveItemsCount = 0
End Sub
Public Sub Add(key As myKeyType, value As myValueType)
MyValues.Add(key, value)
ValidItems.Add(key, True)
LiveItemsCount += 1
End Sub
Public Function Remove(key As myKeyType) As Integer
If ValidItems(key) Then
ValidItems(key) = False
LiveItemsCount -= 1
If LiveItemsCount = 0 Then
MyValues.Clear()
ValidItems.Clear()
End If
End If
Return LiveItemsCount
End Function
Public Function TryGetValue(key As myKeyType, ByRef value As myValueType) As Boolean
If MyValues.TryGetValue(key, value) Then
If ValidItems(key) Then
Return True
Else
value = Nothing
End If
End If
Return False
End Function
Public Function TryGetAndDelete(key As myKeyType, ByRef value As myValueType) As Boolean
If Me.TryGetValue(key, value) Then
ValidItems(key) = False
LiveItemsCount -= 1
If LiveItemsCount = 0 Then
MyValues.Clear()
ValidItems.Clear()
End If
Return True
End If
Return False
End Function
// add more collection-wrapping methods as needed
End Class
You "pay" performance-wise for the overhead of the wrapping class as well as for the auxiliary dictionary that is used inside to keep track of the "real" items versus those that are considered deleted. You save however on the repeated sorting upon deleting an item. It depends of course on the actual case whether it will help (or harm...). And again, I haven't tested it myself.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With