Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MySQL architecture for n * (n - 1) / 2 algorithm

I'm currently developing a website where users can search for other users based on attributes (age, height, town, education, etc.). I now want to implement some kind of rating between user profiles. The rating is calculated via its own algorithm based on similiarity between the 2 given profiles. User A has a rating "match rating" of 85 with User B and 79 with User C for example. B and C have a rating of 94 and so on....

The user should be able to search for certain attributes and filter the results by rating.

Since the rating differs from profile to profile and also depends on the user doing the search, I can't simply add a field to my users table and use ORDER BY. So far I came up with 2 solutions:

  • My first solution was to have a nightly batch job, that calculates the rating for every possible user combination and stores it in a separate table (user1, user2, rating). I then can join this table with the user table and order the result by rating. After doing some math I figured that this solution doesn't scale that well.

    Based on the formula n * (n - 1) / 2 there are 45 possible combination for 10 users. For 1.000 users I suddenly have to insert 499.500 rating combinations into my rating table.

  • The second solution was to leave MySQL be and just calculate the rating on the fly within my application. This also doesn't scale well. Let's say the search should only return 100 results to the UI (with the highest rated on top). If I have 10.000 users and I want to do a search for every user living in New York sorted by rating, I have to load EVERY user that is living in NY into my app (let's say 3.000), apply the algorithm and then return only the top 100 to the user. This way I have loaded 2.900 useless user objects from the DB and wasted CPU on the algorithm without ever doing anything with it.

Any ideas how I can design this in my MySQL db or web app so that a user can have an individual rating with every other user in a way that the system scales beyond a couple thousand users?

like image 346
black666 Avatar asked Oct 01 '12 19:10

black666


People also ask

What is the architecture of MySQL?

The MySQL pluggable storage engine architecture enables a database professional to select a specialized storage engine for a particular application need while being completely shielded from the need to manage any specific application coding requirements.

What type of architecture does MySQL DBMS use?

MYSQL follow Client-Server Architecture. It is designed so that end user that is Clients can access the resources from Computer that is server using various networking services.

What is InnoDB architecture?

InnoDB is a multi-versioned storage engine: it keeps information about old versions of changed rows, to support transactional features such as concurrency and rollback. This information is stored in the tablespace in a data structure called a rollback segment (after an analogous data structure in Oracle).

How MySQL works internally?

MySQL parses queries to create an internal structure (the parse tree), and then applies a variety of optimizations. These can include rewriting the query, determining the order in which it will read tables, choosing which indexes to use, and so on.


2 Answers

If you have to match every user against every other user, the algorithm is O(N^2), whatever you do.

If you can exploit some sort of 1-dimensional "metric", then you can try and associate each user with a single synthetic value. But that's awkward and could be impossible.

But what you can do is to note which users require a change in their profiles (whenever any of the parameters on which the matching is based, changes). At that point you can batch-recalculate the table for those users only, thus working in O(N): if you have 10000 users and only 10 require recalculation, you have to examine 100,000 records instead of 100,000,000.

Other strategies would be to only run the main algorithm for records which have the greater chance of being compared: in your example, "same city". Or when updating records (but this would require to store (user_1, user_2, ranking, last_calculated), only recalculate those records with high ranking, very old, or never calculated. Lowest ranked matches aren't likely to change so much that they float to the top in a short time.

UPDATE

The problem is also operating with O(N^2) storage space.

How to reduce this space? I think I can see two approaches. One is to not put some information in the match table at all. The "match" function makes the more sense the more it is rigid and steep; having ten thousand "good matches" would mean that matching means very little. So we would still need lotsa recalculations when User1 changes some key data, in case it brings some of User1's "no-no" matches back into the "maybe" zone. But we would keep a smaller clique of active matches for each user.

Storage would still grow quadratically, but less steeply.

Another strategy would be to recalculate the match, and then we would need to develop some method for quickly selecting which users are likely to have a good match (thus limiting the number of rows retrieved by the JOIN), and some method to quickly calculate a match; which could entail somehow rewriting the match between User1 and User2 to a very simple function of a subset of DataUser1, DataUser2 (maybe using ancillary columns).

The challenge would be to leverage MySQL capabilities and offload some calculations the the MySQL engine.

To this purpose you might perhaps "map" some data, at input time (therefore in O(k)), to spatial information, or to strings and employ Levenshtein distance.

The storage for a single user would grow, but it would grow linearly, not quadratically, and MySQL SPATIAL indexes are very efficient.

like image 87
LSerni Avatar answered Oct 20 '22 19:10

LSerni


If the search should only return the top 100 best matches, then why not just store those? It sounds like you would never want to search the bottom end of the results anyway, so just don't calculate them.

That way, your storage space is only o(n), rather than o(n^2), and updates should be, as well. If someone really wants to see matches past the first 100 (and you want to let them) then you have the option of running the query in real time at that point.

like image 2
Ian Clelland Avatar answered Oct 20 '22 18:10

Ian Clelland