Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Data Partition in C#

Tags:

c#

.net

sql

mysql

I want to distribute a large amount of data to different C# applications. For example, my table contains millions of records. I would like to specify that first 3 millions records are processed by App1 and next 3 million in another C# application App2 and so on. Table rows are deleted and added as per requirement. Now I want to write a SQL query that will process first 3 million records. Now if 5 records are deleted from app1 then app1 must fetch next 5 records from app2 and app2 from app3. So that data always remain constant in each app.

I have used limit in the SQL query, but I didn't get the required output. How can I write the SQL query for this and how should I design the C# application.

like image 447
A.P.S Avatar asked Jul 26 '10 07:07

A.P.S


2 Answers

It looks a bit as if you want to take over from the database and do the processing that a database is tasked and tailored to do, in your own application. You talk of an SQL query with a LIMIT statement. Don't use that. Millions of records is not much in database terms. If you have performance issues, you may need to index your table or revisit the query design (watch its execution path for performance issues).

If you really cannot let the database do the task and you need to process them one by one in your application, the network latency and bandwidth is likely to be an earlier candidate for performance issues, which you won't make any faster by using multiple apps (let alone the cost of such queries).

If my observations are wrong and your processing of the records must take place outside the database and the network is not a bottleneck, nor are the processors or the database machine and multiple applications will provide a performance gain, then I suggest you create a dispatch application that processes the records and makes them available to your other applications (or better: threads) through normal POCOs. This creates a much easier way of spreading the processing and the dispatch application (or thread) can work as some kind of funnel for your processing applications.

However, look at the cost / benefit equation: is the trouble really going to gain you some performance, or is it better to revisit your design and find a more practical solution?

like image 97
Abel Avatar answered Sep 21 '22 02:09

Abel


That sounds like a really bad idea. Requesting a limit of 3 million records is a very slow operation.

An alternative approach would be to have an instance number column and have each instance of your application reserve rows as it needs them by writing its instance number into this column. Process your data in smaller chunks if possible.

Adding an index to the instance number column will allow you to count how many rows you have already handled and also to find the next batch of 1000 (for example) that haven't been assigned to any instance yet.

like image 38
Mark Byers Avatar answered Sep 21 '22 02:09

Mark Byers