I have to write a program that compares 10'000'000+ Entities against one another. The entities are basically flat rows in a database/csv file. The comparison algorithm has to be pretty flexible, it's based on a rule engine where the end user enters rules and each entity is matched against every other entity. I'm thinking about how I could possibly split this task into smaller workloads but I haven't found anything yet. Since the rules are entered by the end user pre-sorting the DataSet seems impossible. What I'm trying to do now is fit the entire DataSet in memory and process each item. But that's not highly efficient and requires approx. 20 GB of memory (compressed). Do you have an idea how I could split the workload or reduce it's size? Thanks

If your rules are on the highest level of abstraction (e.g. any unknown comparison function), you can't achive your goal. 10^14 comparison operations will run for ages. If the rules are not completely general I see 3 solutions to optimize different cases: <ul> <li>if comparison is transitive and you can calculate hash (somebody already recommended this), do it. Hashes can also be complicated, not only your rules =). Find good hash function and it might help in many cases.</li> <li>if entities are sortable, sort them. For this purpose I'd recommend not sorting in-place, but build an array of indexes (or IDs) of items. If your comparison can be transformed to SQL (as I understand your data is in database), you can perform this on the DBMS side more efficiently and read the sorted indexes (for example 3,1,2 which means that item with ID=3 is the lowest, with ID=1 is in the middle and with ID=2 is the largest). Then you need to compare only adjacent elements.</li> <li>if things are worth, I would try to use some heuristical sorting or hashing. I mean I would create hash which not necessarily uniquely identifies equal elements, but can split your dataset in groups between which there are definitely no one pair of equal elements. Then all equal pairs will be in the inside groups and you can read groups one by one and do manual complex function calculation in the group of not 10 000 000, but for example 100 elements. The other sub-approach is heuristical sorting with the same purpose to guarantee that equal elements aren't on the different endings of a dataset. After that you can read elements one by one and compare with 1000 previous elements for example (already read and kept in memory). I would keep in memory for example 1100 elements and free oldest 100 every time new 100 comes. This would optimize your DB reads. The other implementation of this may be possible also in case your rules contains rules like (Attribute1=Value1) AND (...), or rule like (Attribute1 < Value2) AND (...) or any other simple rule. Then you can make clusterisation first by this criterias and then compare items in created clusters.</li> </ul> By the way, what if your rule considers all 10 000 000 elements equal? Would you like to get 10^14 result pairs? This case proves that you can't solve this task in general case. Try making some limitations and assumptions.

Compare 10 Million Entities

2 Answers

If your rules are on the highest level of abstraction (e.g. any unknown comparison function), you can't achive your goal. 10^14 comparison operations will run for ages.

If the rules are not completely general I see 3 solutions to optimize different cases:

if comparison is transitive and you can calculate hash (somebody already recommended this), do it. Hashes can also be complicated, not only your rules =). Find good hash function and it might help in many cases.
if entities are sortable, sort them. For this purpose I'd recommend not sorting in-place, but build an array of indexes (or IDs) of items. If your comparison can be transformed to SQL (as I understand your data is in database), you can perform this on the DBMS side more efficiently and read the sorted indexes (for example 3,1,2 which means that item with ID=3 is the lowest, with ID=1 is in the middle and with ID=2 is the largest). Then you need to compare only adjacent elements.
if things are worth, I would try to use some heuristical sorting or hashing. I mean I would create hash which not necessarily uniquely identifies equal elements, but can split your dataset in groups between which there are definitely no one pair of equal elements. Then all equal pairs will be in the inside groups and you can read groups one by one and do manual complex function calculation in the group of not 10 000 000, but for example 100 elements. The other sub-approach is heuristical sorting with the same purpose to guarantee that equal elements aren't on the different endings of a dataset. After that you can read elements one by one and compare with 1000 previous elements for example (already read and kept in memory). I would keep in memory for example 1100 elements and free oldest 100 every time new 100 comes. This would optimize your DB reads. The other implementation of this may be possible also in case your rules contains rules like (Attribute1=Value1) AND (...), or rule like (Attribute1 < Value2) AND (...) or any other simple rule. Then you can make clusterisation first by this criterias and then compare items in created clusters.

By the way, what if your rule considers all 10 000 000 elements equal? Would you like to get 10^14 result pairs? This case proves that you can't solve this task in general case. Try making some limitations and assumptions.

answered Nov 12 '22 17:11

Sasha

I would try to think about rule hierarchy. Let's say for example that rule A is "Color" and rule B is "Shape".

If you first divide objects by color, than there is no need to compare Red circle with Blue triangle.

This will reduce the number of compares you will need to do.

answered Nov 12 '22 19:11

omer schleifer

Related questions
                            
                                How to read AppSettings from app.config in WinForms
                            
                                new keyword without class name in c#
                            
                                Is there a way to enforce parameterless constructor without generic constraint
                            
                                Difference between static and non static members? [duplicate]
                            
                                c# and java - difference between hmacsha256 hash
                            
                                How can I use linq to initialize an array of repeated elements?
                            
                                Get filename without Content-Disposition
                            
                                Simple Injector pass hard coded values into constructor
                            
                                Is compiling Release and Debug going to generate different IL code + different machine code?
                            
                                How to deserialize JSON array of objects to c# structure
                            
                                UIAutomation won't retrieve children of an element
                            
                                LINQ select on a SQL View gets wrong answer
                            
                                C# event subscribe and unsubscribe duplicates
                            
                                To run this application, you first must install .Net 4.5
                            
                                Horizontal orientated WrapPanel within ItemsControl lists vertically
                            
                                T of Func<S, T> is inferred from output of lambda expression only when S and T are different?
                            
                                With LINQ, get count of items that satisfy criteria in grouping
                            
                                EF Code first cascade delete on foreign key one-to-many
                            
                                generate xml files based on my c# classes
                            
                                FileStream.close() does not free file for other processes

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Compare 10 Million Entities

Tags:

c#

algorithm

matching

senic

People also ask

2 Answers

Sasha

omer schleifer

Recent Activity

Donate For Us