I'm working on an application that runs on Windows Mobile 6 that needs to be able to retrieve all items from an item table that contain a given string (provided by the end user) within the item's description field. The problem is that there are approximately 170,000 items in the table. Since I need to return all items that contain the string anywhere within the description I'm forced to use a LIKE %string%, which eliminates any chance to use the index. The data and table structure are originally based on a Progress database, which has a wonderful contains operator on any word-indexed fields. This is not the case on our mobile application, since it uses SQL Server Compact 3.5.
Basically, my DAL runs the query and retrieves an SqlCeDataReader and then uses an ItemFactory to create a List object that contains only the matched items. This obviously lets us keep our domain/business objects separate from the data access layer.
Fine and dandy, except for the 8m and 42s that it takes to retrieve items when I search for all items that contain something like "golf" in the description. Obviously this is not an acceptable time frame for the end user.
My first attempt was to instead retrieve all items back from the database using SELECT * FROM Item" (with an order by clause on one of the main indexed fields). At this point I ran an IndexOf check as I ran through the SqlCeDataReader and had the ItemFactory only add items to the List object if they contained the requested description text. This improved the speed down to 1m 46s. Not too shabby, but still too slow.
I then tried another approach that showed promise... almost... While the application starts up, I tried creating a List that contained all item objects within the database (takes about 2 minutes to run the query and populate the entire list, but at least it's only once as the app is initializing... still... ugh). Once the list is complete, I can easily run queries on that list doing things such as the following (I hope my syntax is right... I'm not at work right now and I don't have Visual Studio on the pc I'm sitting at):
List<Item> specificItems =
AllItems.FindAll(i => i.Description.IndexOf(searchString, StringComparison.OrdinalIgnoreCase) >= 0);
This approach knocked it down to 21s. Very nice (still slow though in the grand scheme of things). However, the problem is that the memory usage is way too great if I load all the items from the database. I had to actually cut off the last 20,000 items (so the 21s time frame probably would have been more like 25s) during the initial load, because an OutOfMemoryException was being thrown. According to the memory manager on the emulator, I still had about 20 MB of free RAM, but I've heard that a process can only have 32 MB or RAM associated to it (not sure if that's true for WM 6, but it appears so).
To make sure it wasn't because I was using a List object to hold all the items (which I was instantiating with the needed capacity in its constructor to avoid dynamic resizing), which I've also read may cause extra memory usage when it implicity calls EnsureCapacity, I tried using an Item[] array (sized ahead of time). This still had the memory issue and the size difference was negligible.
Ok enough rambling. I know I'm likely going to have to some how limit the records returned from the database by the datareader (through some indexed search on a different type of field) and then will likely use indexOf on that smaller subset of items to get maximum performance (thus skipping the Like operator all together). This will cause the end user to have to enter more than just a description search though (perhaps item hierarchy information to limit which type of items to search within).
Any ideas? Am I going about this the wrong way?
Thanks for listening (sorry this post is long, I'm kind of thinking out loud).
Oh I should add (just in summary) what I'm using:
UPDATE: While the Bloom Filter approach mentioned below seemed interesting, I could not fulfill one requirement (which I didn't really specify above). I can't really match words that are contained inside other words (e.g. "club" would not return "clubs"). Due to this, I was forced to use a different approach altogether (Kent Fredric... thanks for pointing this out). I've marked Kent's answer as the correct one, since his approach was the one that filled the most requirements (Mitch, your's had a similar issue as the Bloom filter suggested by Jaunder). However, I have gone a different approach (for now...) than his way as well.
What I've done is pulled all item objects into memory, with only item numbers and descriptions (which keeps it under the memory limitations, however it still does cause a longer initialization than I like... multithreading and loading that information behind the scenes while the application is running can take care of that I guess). To perform the searches I've written my own contains routine. The routine is written in unmanaged c# code that uses two pointers and a couple of loops to run through the description and the required matching text. If it finds a match anywhere in the description, it adds the item number to an array. Once all items have been searched, a new query goes back to the database and grabs only the matching item numbers (which is very fast due to the index on an integer field). Then those items are created in a List with all information (not just the item number and description). The whole operation takes approximately 5 - 10 seconds (depending on the description), which is good enough for now.
I'll still look into further optimizing this (might be able to track how many characters the search term is... if there are less characters remaining on the item description than the required text, the loop could continue straight to the next item).
Any suggestions are still welcome. For now I have marked Kent's answer as "the most correct" for my question.
Props to Dolch for helping me write the contains routine.
How about pre-processing (once) the items table (and each new entry added to it would need to be processed), to create a word occurrance table having
CREATE TABLE WordItemOccurance
(
[Word] varchar(50) not null,
ItemId int not null
constraint FK_Items references ItemTable(ID)
)
Iterate over all your items, break into separate words and add entries to the occurance table as they are found.
Creating a clustered index on [Word] and joining to the Item table on ItemId should be fast.
I voted for Mitch Wheat's answer, but there are a few tricks I would also test for effectiveness.
My big worry about having a table full of [char],[int] is you may still find yourself executing large volumes of pointless string comparisons, especially if you use %word% on this new table. ( Duplicated but not-matching-our-search entries ).
I would probably opt for experimeting with
Words
-----
chars | word_id
WordsToEntry
------------
word_id | entry_id
and see if the database overhead is a worthy mitigation of this possible problem ( I cant test, sorry )
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With