Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

fastest way to search huge list of big texts

I have a windows application written in C# that needs to load load 250,000 rows from database and provide a "search as you type" feature which means as soon as user types something in a text box, the application needs to search all 250,000 records (which are btw, single column with 1000 characters each row) using like search and display the found records.

The approach I followed was:

1- The application loads all the records into a typed List<EmployeeData>

while (objSQLReader.Read())
{
    lstEmployees.Add(new EmployeesData(
        Convert.ToInt32(objSQLReader.GetString(0)),
        objSQLReader.GetString(1), 
        objSQLReader.GetString(2)));
}

2- In TextChanged event, Using LINQ, I search (with combination of Regular Expression) and attach the IEnumerable<EmployeesData> to a ListView which is in Virtual Mode.

String strPattern = "(?=.*wood*)(?=.*james*)";
    IEnumerable<EmployeesData> lstFoundItems = from objEmployee in lstEmployees
    where Regex.IsMatch(Employee.SearchStr, strPattern, RegexOptions.IgnoreCase)
    select objEmployee;
    lstFoundEmployees = lstFoundItems;

3- RetrieveVirtualItem event is handled to display items in ListView to display the item.

e.Item = new ListViewItem(new String[] { 
    lstFoundEmployees.ElementAt(e.ItemIndex).DateProjectTaskClient, 
    e.ItemIndex.ToString() });

Though the lstEmployees is loaded relatively fast (1.5 seconds) for loading the list from SQL Server, to search on TextChanged, it takes more than 7 minutes to search using LINQ. Searching thru SQL Server directly by performing a LIKE search takes less than 7 seconds.

What am I doing wrong here? How can I make this search faster (not more 2 seconds)? This is a requirement from my client. So, any help is highly appreciated. Please Help...

like image 243
user1130862 Avatar asked Jan 06 '12 02:01

user1130862


3 Answers

Does the database column that stores the text data have an index on it? If so, something similar to the trie structure that Nicholas described is already in use. Indexes in SQL Server are implemented using B+ trees, which have a an average search time on the order of log base 2 of n, where n is the height of the tree. This means that if you have 250,000 records in the table the number of operations required to search are log base 2 ( 250,000 ) or approximately 18 operations.

When you load all of the information into a data reader and then use a LINQ expression it's a linear operation, (O) n, where n is the length of the list. So worst case, it's going to be 250,000 operations. If you use a DataView there will be indexes that can be used to help with searching, which will drastically improve performance.

At the end of the day if there will not be too many requests submitted against the database server leverage the query optimizer to do this. As long as the LIKE operation isn't performed with a wildcard at the front of the string (i.e. LIKE %some_string) (negates the use of an index) and there is an index on the table you will have really fast performance. If there are just too many requests that will be submitted to the database server, either put all of the information into a DataView so an index can be used, or use a dictionary as Tim suggested above, which has a search time of O(1) (on the order of one), assuming the dictionary is implemented using a hash table.

like image 50
Christopher W. Szabo Avatar answered Nov 12 '22 14:11

Christopher W. Szabo


You'd be wanting to preload things and build yourself a data structure called a trie

a trie, waht else?

It's memory-intensive, but it's what the doctor ordered in this case.

like image 29
Nicholas Carey Avatar answered Nov 12 '22 14:11

Nicholas Carey


See my answer to this question. If you need instant response (i.e. as fast as a user types), loading the data into memory can be a very attractive option. It may use a bit of memory, but it is very fast.

Even though there are many characters (250K records * 1000), how many unique values are there? An in-memory structure based off of keys with pointers to records matching those keys really doesn't have to be that big, even accounting for permutations of those keys.

If the data it truly won't fit into memory or changes frequently, keep it in the database and use SQL Server Full Text Indexing, which will handle searches such as this much better than a LIKE. This assumes a fast connection from the application to the database.

Full Text Indexing offers a powerful set of operators/expressions which can be used to make searches more intelligent. It's available with the free SQL Expression Edition, which will handle up to 10GB of data.

like image 2
Tim M. Avatar answered Nov 12 '22 13:11

Tim M.