Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which C# data structure allows searching a pair of strings most efficiently for substrings?

I have a data structure which consists of pairs of values, the first of which is an integer and the second of which is an alphanumeric string (which may begin with digits):

+--------+-----------------+
| Number | Name            |
+--------+-----------------+
| 15     | APPLES          |
| 16     | APPLE COMPUTER  |
| 17     | ORANGE          |
| 21     | TWENTY-1        |
| 291    | 156TH ELEMENT   |
+--------+-----------------+

A table of these would comprise up to 100,000 rows.

I'd like to provide a lookup function in which the user can look up either the number (as if it were a string), or pieces of the string. Ideally the lookup will be "live" as the user types; after each keystroke (or maybe after a brief delay ~250-500 ms) a new search will be done to find the most likely candidates. So, for example searching on

  • 1 will return 15 APPLES, 16 APPLE COMPUTER, 17 ORANGE, and 291 156TH ELEMENT
  • 15 will narrow the search to 15 APPLES, 291 156TH ELEMENT
  • AP will return 15 APPLES and 16 APPLE COMPUTER
  • (ideally, but not required) ELEM will return 291 156TH ELEMENT.

I was thinking about using two Dictionary<string, string>s since ultimately the ints are being compared as strings -- one will index by the integer part and the other by the string part.

But really searching by substring shouldn't use a hash function, and it seems wasteful to use twice the memory that I feel like I should need.

Ultimately the question is, is there any well-performing way to text search two large lists simultaneously for substrings?

Failing that, how about a SortedDictionary? Might increase performance but still wouldn't solve the hash problem.

Thought about creating a regex on the fly, but I would think that would perform terribly.

I'm new to C# (having come from the Java world) so I haven't looked into LINQ yet; is that the answer?

EDIT 18:21 EST: None of the strings in the "Name" field will be longer than 12-15 characters, if that affects your potential solution.

like image 672
James Cronen Avatar asked Jan 24 '12 22:01

James Cronen


People also ask

Which version of C is best?

Yes, it's a bit odd that you can get a loud consensus that K&R is a great C book, and also a loud consensus that C99 is the correct/current/best version of C.

Is C still used in 2022?

C is one of the earliest and most widely used programming languages. C is the fourth most popular programming language in the world as of January 2022. Modern languages such as Go, Swift, Scala, and Python are not as popular as C. Where is C used today?

What C is used for?

C programming language is a machine-independent programming language that is mainly used to create many types of applications and operating systems such as Windows, and other complicated programs such as the Oracle database, Git, Python interpreter, and games and is considered a programming foundation in the process of ...


1 Answers

If possible, I would avoid loading all 100,000 entries into memory. I would use either a database or Lucene.Net to index the values. Then use the appropriate query syntax to efficiently search for the results.

like image 168
Phil Bolduc Avatar answered Sep 20 '22 11:09

Phil Bolduc