Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to search in a string collection

Problem:

I have a text file of around 120,000 users (strings) which I would like to store in a collection and later to perform a search on that collection.

The search method will occur every time the user change the text of a TextBox and the result should be the strings that contain the text in TextBox.

I don't have to change the list, just pull the results and put them in a ListBox.

What I've tried so far:

I tried with two different collections/containers, which I'm dumping the string entries from an external text file (once, of course):

  1. List<string> allUsers;
  2. HashSet<string> allUsers;

With the following LINQ query:

allUsers.Where(item => item.Contains(textBox_search.Text)).ToList();

My search event (fires when user change the search text):

private void textBox_search_TextChanged(object sender, EventArgs e) {     if (textBox_search.Text.Length > 2)     {         listBox_choices.DataSource = allUsers.Where(item => item.Contains(textBox_search.Text)).ToList();     }     else     {         listBox_choices.DataSource = null;     } } 

Results:

Both gave me a poor response time (around 1-3 seconds between each key press).

Question:

Where do you think my bottleneck is? The collection I've used? The search method? Both?

How can I get better performance and more fluent functionality?

like image 716
etaiso Avatar asked Jan 27 '14 11:01

etaiso


People also ask

How do I find a string in collections?

Methods: Simple loop through HashSet and call endsWith() on each string. Simple loop through HashSet and perform precompiled Pattern match (regex) on each string. Convert HashSet to single String joined by \n and single match on whole String.

Which collection is faster in C#?

ListDictionary is faster than Hashtable for small collections (10 items or fewer). The Dictionary<TKey,TValue> generic class provides faster lookup than the SortedDictionary<TKey,TValue> generic class.


2 Answers

You could consider doing the filtering task on a background thread which would invoke a callback method when it's done, or simply restart filtering if input is changed.

The general idea is to be able to use it like this:

public partial class YourForm : Form {     private readonly BackgroundWordFilter _filter;      public YourForm()     {         InitializeComponent();          // setup the background worker to return no more than 10 items,         // and to set ListBox.DataSource when results are ready          _filter = new BackgroundWordFilter         (             items: GetDictionaryItems(),             maxItemsToMatch: 10,             callback: results =>                this.Invoke(new Action(() => listBox_choices.DataSource = results))         );     }      private void textBox_search_TextChanged(object sender, EventArgs e)     {         // this will update the background worker's "current entry"         _filter.SetCurrentEntry(textBox_search.Text);     } } 

A rough sketch would be something like:

public class BackgroundWordFilter : IDisposable {     private readonly List<string> _items;     private readonly AutoResetEvent _signal = new AutoResetEvent(false);     private readonly Thread _workerThread;     private readonly int _maxItemsToMatch;     private readonly Action<List<string>> _callback;      private volatile bool _shouldRun = true;     private volatile string _currentEntry = null;      public BackgroundWordFilter(         List<string> items,         int maxItemsToMatch,         Action<List<string>> callback)     {         _items = items;         _callback = callback;         _maxItemsToMatch = maxItemsToMatch;          // start the long-lived backgroud thread         _workerThread = new Thread(WorkerLoop)         {             IsBackground = true,             Priority = ThreadPriority.BelowNormal         };          _workerThread.Start();     }      public void SetCurrentEntry(string currentEntry)     {         // set the current entry and signal the worker thread         _currentEntry = currentEntry;         _signal.Set();     }      void WorkerLoop()     {         while (_shouldRun)         {             // wait here until there is a new entry             _signal.WaitOne();             if (!_shouldRun)                 return;              var entry = _currentEntry;             var results = new List<string>();              // if there is nothing to process,             // return an empty list             if (string.IsNullOrEmpty(entry))             {                 _callback(results);                 continue;             }              // do the search in a for-loop to              // allow early termination when current entry             // is changed on a different thread             foreach (var i in _items)             {                 // if matched, add to the list of results                 if (i.Contains(entry))                     results.Add(i);                  // check if the current entry was updated in the meantime,                 // or we found enough items                 if (entry != _currentEntry || results.Count >= _maxItemsToMatch)                     break;             }              if (entry == _currentEntry)                 _callback(results);         }     }      public void Dispose()     {         // we are using AutoResetEvent and a background thread         // and therefore must dispose it explicitly         Dispose(true);     }      private void Dispose(bool disposing)     {         if (!disposing)             return;          // shutdown the thread         if (_workerThread.IsAlive)         {             _shouldRun = false;             _currentEntry = null;             _signal.Set();             _workerThread.Join();         }          // if targetting .NET 3.5 or older, we have to         // use the explicit IDisposable implementation         (_signal as IDisposable).Dispose();     } } 

Also, you should actually dispose the _filter instance when the parent Form is disposed. This means you should open and edit your Form's Dispose method (inside the YourForm.Designer.cs file) to look something like:

// inside "xxxxxx.Designer.cs" protected override void Dispose(bool disposing) {     if (disposing)     {         if (_filter != null)             _filter.Dispose();          // this part is added by Visual Studio designer         if (components != null)             components.Dispose();     }      base.Dispose(disposing); } 

On my machine, it works pretty quickly, so you should test and profile this before going for a more complex solution.

That being said, a "more complex solution" would possibly be to store the last couple of results in a dictionary, and then only filter them if it turns out that the new entry differs by only the first of last character.

like image 77
Groo Avatar answered Oct 06 '22 12:10

Groo


I've done some testing, and searching a list of 120,000 items and populating a new list with the entries takes a negligible amount of time (about a 1/50th of a second even if all strings are matched).

The problem you're seeing must therefore be coming from the populating of the data source, here:

listBox_choices.DataSource = ... 

I suspect you are simply putting too many items into the listbox.

Perhaps you should try limiting it to the first 20 entries, like so:

listBox_choices.DataSource = allUsers.Where(item => item.Contains(textBox_search.Text))     .Take(20).ToList(); 

Also note (as others have pointed out) that you are accessing the TextBox.Text property for each item in allUsers. This can easily be fixed as follows:

string target = textBox_search.Text; listBox_choices.DataSource = allUsers.Where(item => item.Contains(target))     .Take(20).ToList(); 

However, I timed how long it takes to access TextBox.Text 500,000 times and it only took 0.7 seconds, far less than the 1 - 3 seconds mentioned in the OP. Still, this is a worthwhile optimisation.

like image 23
Matthew Watson Avatar answered Oct 06 '22 12:10

Matthew Watson