Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to improve performance of matching algorithm

I am writing an algorithm for matching on basis of interests and location. Suppose I have this data of users

{
    "users": [{
            "location": "Delhi, India",
            "interests": ["Jogging", "Travelling", "Praying"],
            "groups": ["exercise", "travelling", "Praying"]
        },
        {
            "location": "Delhi, India",
            "interests": ["Running", "Eating", "Praying"],
            "groups": ["exercise", "Eating", "Praying"]
        }, {
            "location": "Delhi, India",
            "interests": ["Shopping"],
            "groups": ["Shopping"]
        }
    ]
}

Here they user1 and user2 has similar interest "exercise" and "Praying" and user1 and user3 has no similar interest.

To find similar interest people in a database of 10+ millions users can impact on my database performance if I use SQL query with where clause everytime on receiving request from mobile app.

SELECT * FROM users WHERE groups = "exercise" OR groups = "travelling" OR groups = "Praying";

This will check each profiles that may impact on performance of my application. I do not want to use this approach as this is not going to work long. What algorithm should I use for this to have high performance ?

like image 523
N Sharma Avatar asked Apr 24 '17 07:04

N Sharma


People also ask

What is best matching algorithm?

BM25 [5] is the Best Matching ranking algorithm that ranks the documents according to its relevance to the given search query.

How do you evaluate a matching algorithm?

To evaluate your matching algorithm, you may manually identify some landmarks (keypoint) in the first image (X1) and their correspondence in the second image(Xac). Given the landmarks of the first image(X1), use your matching algorithm to find their correspondence in the second image (Xm).

What is fast pattern matching algorithm?

Pattern-matching algorithms scan the text with the help of a window, whose size is equal to the length of the pattern. The first step is to align the left ends of the window and the text and then compare the corresponding characters of the window and the pattern; this procedure is known as attempt.

What is the time complexity of string matching algorithm?

Time complexity O(m|Σ|) of this preprocessing (m = |x|, i.e.


2 Answers

You can construct an inverted index where key would be one of the tokens (i.e. either exercise, travelling etc) in 'group' and the value would be a list of Users that fall under that group. For example, your inverted index would look something like this:

Key: ListOfValues
Exercise: User1 -> User2
Praying: User1 -> User2
Travelling: User1 -> User3 -> User8 -> User14
Shopping: User3

Whether you want a tree based, bitmap or a hash table based inverted index would be your choice depending on your space/time tradeoffs.

Now when you get a new user, say User99 having group (Exercise and Praying) you can quickly retrieve the values (i.e. the Users) for the 'Exercise' token then retrieve the values for 'Praying' token and then finally do an 'AND'(intersection) for the two.

Note that running it for the first time will be batch processing however as and when you start getting new users, your running time complexity would almost be constant (This would hold true if you have smart Data structure something like a compressed bitmap as your posting lists for 'User' values in the inverted index otherwise intersection wont be faster than O(n) AFAIK)

like image 116
Yavar Avatar answered Sep 29 '22 08:09

Yavar


An Illustration:

If you have some way of obtaining a complete list of interests (perhaps you are letting them choose a specific entry from a set of interests), you can use simple matrix multiplication with a respective search vector.

Edit: This approach also works inverted i.e, so long as you transpose properly you can map users to groups instead of groups to users, and you may like to do this as you will likely have far more users than groups, although the example is the same in principal.

Let groups = [
  1: "exercise" 
  2: "traveling"
  3: "praying"
  4: "eating"
  5: "running"
  6: "shopping"
]

Let U = [
  1 1 1 0 0 0  // user 1
  0 0 1 1 1 0  // user 2
  0 0 0 0 0 1  // user 3
]  

You are using OR as you wanted members in any group

Let V = [
  1  // exercise
  1  // traveling
  1  // praying
  0  // eating
  0  // running
  0  // shopping
]

Multiplying:

U · V = [
  3  // user 1 is in all 3 groups => match
  1  // user 2 is in one group => match
  0  // user 3 is in no groups => no match
]

This checks every user for the presence of one ore more of the requested columns (OR), and any nonzero entries in the resulting vector are a match.

Alternatively, with the same exact query, selecting only users with a specific set of 2 or more columns (AND) would consider a match as any n or greater valued entries in the resulting vector where n is the number of parameters.

Selecting only those with specifically one or more columns and not one or more other columns (XOR) would consider as matches only those resulting entries with exactly n value.

Is This Really a Good Idea? This sort of approach could be used if you think the real issue is that queries may become sufficiently complex that the query analyzer would become the bottleneck, queries would become extremely hard to manage, or you need to deal with an ever altering list of 'groups', or if you simply intend to do a lot of Linear Algebra in your application.

The solution depends foremost on your use case. For example, if query speed is of utmost importance and data transfer is less of an issue, this approach would allow for a very simple query that returns all (with LIMIT) the rows and you can then optimally sift through until you find the number of users you want for a given page, only running subsequent queries as necessary to load more pages. Since you mentioned this occurring every time you receive a request from a mobile app, perhaps you would be better off caching a manageable amount of users and polling this instead of the database every time, unless insufficient matches are found, implementing a suitable time-tested cache-replacement algorithm (which can also potentially be offloaded to the client to some extent).

Conclusion/tl;dr The important take-away here is the structure you want entirely depends on the business requirements of your application. You can make the data structures as esoteric as you like with the intent of improving the performance, but this is often less fruitful than simply using a time-tested solution such as basic caching.

If you think a refactor to an inverted key approach like that suggested by Yavar will best suit your needs, that may be your solution.

If you think a graph database is necessary, will fulfill your business requirement, and be faster, easy to manage, this may be your solution.

If your needs are so specific that you need an entirely specific custom-built implementation that is entirely optimized to your application and not necessarily useful elsewhere, that may be your solution.

Design standards exist for good reasons, but optimization can incidentally be very domain specific. As you can see, there are several possible solutions, but choosing exactly what solution would be best for you is dependent on a lot of unknown factors such as business requirements and ultimately the correct solution would be one that is sufficiently fast without sacrificing maintainability/sanity/hair.

like image 40
semchev Avatar answered Sep 29 '22 08:09

semchev