Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimal storage of data structure for fast lookup and persistence

Scenario

I have the following methods:

public void AddItemSecurity(int itemId, int[] userIds)
public int[] GetValidItemIds(int userId)

Initially I'm thinking storage on the form:

itemId -> userId, userId, userId

and

userId -> itemId, itemId, itemId

AddItemSecurity is based on how I get data from a third party API, GetValidItemIds is how I want to use it at runtime.

There are potentially 2000 users and 10 million items. Item id's are on the form: 2007123456, 2010001234 (10 digits where first four represent the year).

AddItemSecurity does not have to perform super fast, but GetValidIds needs to be subsecond. Also, if there is an update on an existing itemId I need to remove that itemId for users no longer in the list.

I'm trying to think about how I should store this in an optimal fashion. Preferably on disk (with caching), but I want the code maintainable and clean.

If the item id's had started at 0, I thought about creating a byte array the length of MaxItemId / 8 for each user, and set a true/false bit if the item was present or not. That would limit the array length to little over 1mb per user and give fast lookups as well as an easy way to update the list per user. By persisting this as Memory Mapped Files with the .Net 4 framework I think I would get decent caching as well (if the machine has enough RAM) without implementing caching logic myself. Parsing the id, stripping out the year, and store an array per year could be a solution.

The ItemId -> UserId[] list can be serialized directly to disk and read/write with a normal FileStream in order to persist the list and diff it when there are changes.

Each time a new user is added all the lists have to updated as well, but this can be done nightly.

Question

Should I continue to try out this approach, or are there other paths which should be explored as well? I'm thinking SQL server will not perform fast enough, and it would give an overhead (at least if it's hosted on a different server), but my assumptions might be wrong. Any thought or insights on the matter is appreciated. And I want to try to solve it without adding too much hardware :)

[Update 2010-03-31]

I have now tested with SQL server 2008 under the following conditions.

  • Table with two columns (userid,itemid) both are Int
  • Clustered index on the two columns
  • Added ~800.000 items for 180 users - Total of 144 million rows
  • Allocated 4gb ram for SQL server
  • Dual Core 2.66ghz laptop
  • SSD disk
  • Use a SqlDataReader to read all itemid's into a List
  • Loop over all users

If I run one thread it averages on 0.2 seconds. When I add a second thread it goes up to 0.4 seconds, which is still ok. From there on the results are decreasing. Adding a third thread brings alot of the queries up to 2 seonds. A forth thread, up to 4 seconds, a fifth spikes some of the queries up to 50 seconds.

The CPU is roofing while this is going on, even on one thread. My test app takes some due to the speedy loop, and sql the rest.

Which leads me to the conclusion that it won't scale very well. At least not on my tested hardware. Are there ways to optimize the database, say storing an array of int's per user instead of one record per item. But this makes it harder to remove items.

[Update 2010-03-31 #2]

I did a quick test with the same data putting it as bits in memory mapped files. It performs much better. Six threads yields access times between 0.02s and 0.06s. Purely memory bound. The mapped files were mapped by one process, and accessed by six others simultaneously. And as the sql base took 4gb, the files on disk took 23mb.

like image 717
Mikael Svenson Avatar asked Mar 30 '10 14:03

Mikael Svenson


2 Answers

After much testing I ended up using Memory Mapped Files, marking them with the sparse bit (NTFS), using code from NTFS Sparse Files with C#.

Wikipedia has an explanation of what a sparse file is.

The benefits of using a sparse file is that I don't have to care about what range my id's are in. If I only write id's between 2006000000 and 2010999999, the file will only allocate 625,000 bytes from offset 250,750,000 in the file. All space up to that offset is unallocated in the file system. Each id is stored as a set bit in the file. Sort of treated as an bit array. And if the id sequence suddenly changes, then it will allocate in another part of the file.

In order to retrieve which id's are set, I can perform a OS call to get the allocated parts of the sparse file, and then I check each bit in those sequences. Also checking if a particular id is set is very fast. If it falls outside the allocated blocks, then it's not there, if it falls within, it's merely one byte read and a bit mask check to see if the correct bit is set.

So for the particular scenario where you have many id's which you want to check on with as much speed as possible, this is the most optimal way I've found so far.

And the good part is that the memory mapped files can be shared with Java as well (which turned out to be something needed). Java also has support for memory mapped files on Windows, and implementing the read/write logic is fairly trivial.

like image 198
Mikael Svenson Avatar answered Oct 17 '22 04:10

Mikael Svenson


I really think you should try a nice database before you make your decision. Something like this will be a challenge to maintain in the long run. Your user-base is actually quite small. SQL Server should be able to handle what you need without any problems.

like image 25
ChaosPandion Avatar answered Oct 17 '22 03:10

ChaosPandion