Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Iterating over large collections in C#: Taking very long

I recently started a WPF application. I connected that to a BaseX (XML-based) database and retrieved about one million entries from it. I wanted to iterate over the entries, calculate something for each entry and then write that back to the database:

IEnumerable<Result> resultSet = baseXClient.Query("...", "database");
foreach (Result result in resultSet) 
{
    ...
}

The problem: The inside of the foreach is never reached. the Query() method returns pretty fast, but when the foreach is reached C# seems to do SOMETHING with the collection, the code is not continuing for a very very long time (at least 10 minutes, never let it run any longer). What's going on here? I tried to limit the number of items retrieved. When retrieving 100.000 results, the same thing occurs but the code continues after about 10-20 seconds. When retrieving the full one million results, C# seems to be stuck forever...

Any ideas? regards

Edit: Why this is happening As some of you pointed out, the reason for this behavior seems to be that the query is actually only evaluated when MoveNext() on the Enumerator inside the Enumerable is called. My database seems unable to return one value at a time, but instead returns the entire one million dataset at once. I will try to switch to another database (Apache Lucene, if possible, as it has good fulltext search support) and edit this post to let you know if it changed anything.
PS: Yes, I am aware that one million results is a lot. This is not meant for live usage, it is just a step for preparing the data. While I didn't expect the code to run in a few seconds, I was still surprised to see SUCH poor performance in the database.

Edit: The Solution So I migrated the XML database to Apache Lucine. Works like a charm! Of course Lucine is a text-based database that is not suitable for every use case, but for me it worked wonders. Can iterate over one million entries in a few seconds, one entry per loop is fetched - works extremly well!

like image 736
BlackWolf Avatar asked Jul 02 '12 18:07

BlackWolf


1 Answers

Let me quess - you are NOT loading the data when youcreate the rsultSet, but when it is first accessed (delayed execution), and loading one million entries you just take a lot of time to deserialize them into memory.

Welcome to the inefficiences of XML databases.

like image 160
TomTom Avatar answered Nov 15 '22 04:11

TomTom