How to improve performance of matching algorithm

Tags:

I am writing an algorithm for matching on basis of interests and location. Suppose I have this data of users

{
    "users": [{
            "location": "Delhi, India",
            "interests": ["Jogging", "Travelling", "Praying"],
            "groups": ["exercise", "travelling", "Praying"]
        },
        {
            "location": "Delhi, India",
            "interests": ["Running", "Eating", "Praying"],
            "groups": ["exercise", "Eating", "Praying"]
        }, {
            "location": "Delhi, India",
            "interests": ["Shopping"],
            "groups": ["Shopping"]
        }
    ]
}

Here they user1 and user2 has similar interest "exercise" and "Praying" and user1 and user3 has no similar interest.

To find similar interest people in a database of 10+ millions users can impact on my database performance if I use SQL query with where clause everytime on receiving request from mobile app.

SELECT * FROM users WHERE groups = "exercise" OR groups = "travelling" OR groups = "Praying";

This will check each profiles that may impact on performance of my application. I do not want to use this approach as this is not going to work long. What algorithm should I use for this to have high performance ?

523

asked Apr 24 '17 07:04

N Sharma

2 Answers

You can construct an inverted index where key would be one of the tokens (i.e. either exercise, travelling etc) in 'group' and the value would be a list of Users that fall under that group. For example, your inverted index would look something like this:

Key: ListOfValues
Exercise: User1 -> User2
Praying: User1 -> User2
Travelling: User1 -> User3 -> User8 -> User14
Shopping: User3

Whether you want a tree based, bitmap or a hash table based inverted index would be your choice depending on your space/time tradeoffs.

Now when you get a new user, say User99 having group (Exercise and Praying) you can quickly retrieve the values (i.e. the Users) for the 'Exercise' token then retrieve the values for 'Praying' token and then finally do an 'AND'(intersection) for the two.

Note that running it for the first time will be batch processing however as and when you start getting new users, your running time complexity would almost be constant (This would hold true if you have smart Data structure something like a compressed bitmap as your posting lists for 'User' values in the inverted index otherwise intersection wont be faster than O(n) AFAIK)

116

answered Sep 29 '22 08:09

Yavar

An Illustration:

If you have some way of obtaining a complete list of interests (perhaps you are letting them choose a specific entry from a set of interests), you can use simple matrix multiplication with a respective search vector.

Edit: This approach also works inverted i.e, so long as you transpose properly you can map users to groups instead of groups to users, and you may like to do this as you will likely have far more users than groups, although the example is the same in principal.

Let groups = [
  1: "exercise" 
  2: "traveling"
  3: "praying"
  4: "eating"
  5: "running"
  6: "shopping"
]

Let U = [
  1 1 1 0 0 0  // user 1
  0 0 1 1 1 0  // user 2
  0 0 0 0 0 1  // user 3
]

You are using OR as you wanted members in any group

Let V = [
  1  // exercise
  1  // traveling
  1  // praying
  0  // eating
  0  // running
  0  // shopping
]

Multiplying:

U · V = [
  3  // user 1 is in all 3 groups => match
  1  // user 2 is in one group => match
  0  // user 3 is in no groups => no match
]

This checks every user for the presence of one ore more of the requested columns (OR), and any nonzero entries in the resulting vector are a match.

Alternatively, with the same exact query, selecting only users with a specific set of 2 or more columns (AND) would consider a match as any n or greater valued entries in the resulting vector where n is the number of parameters.

Selecting only those with specifically one or more columns and not one or more other columns (XOR) would consider as matches only those resulting entries with exactly n value.

Is This Really a Good Idea? This sort of approach could be used if you think the real issue is that queries may become sufficiently complex that the query analyzer would become the bottleneck, queries would become extremely hard to manage, or you need to deal with an ever altering list of 'groups', or if you simply intend to do a lot of Linear Algebra in your application.

The solution depends foremost on your use case. For example, if query speed is of utmost importance and data transfer is less of an issue, this approach would allow for a very simple query that returns all (with LIMIT) the rows and you can then optimally sift through until you find the number of users you want for a given page, only running subsequent queries as necessary to load more pages. Since you mentioned this occurring every time you receive a request from a mobile app, perhaps you would be better off caching a manageable amount of users and polling this instead of the database every time, unless insufficient matches are found, implementing a suitable time-tested cache-replacement algorithm (which can also potentially be offloaded to the client to some extent).

Conclusion/tl;dr The important take-away here is the structure you want entirely depends on the business requirements of your application. You can make the data structures as esoteric as you like with the intent of improving the performance, but this is often less fruitful than simply using a time-tested solution such as basic caching.

If you think a refactor to an inverted key approach like that suggested by Yavar will best suit your needs, that may be your solution.

If you think a graph database is necessary, will fulfill your business requirement, and be faster, easy to manage, this may be your solution.

If your needs are so specific that you need an entirely specific custom-built implementation that is entirely optimized to your application and not necessarily useful elsewhere, that may be your solution.

Design standards exist for good reasons, but optimization can incidentally be very domain specific. As you can see, there are several possible solutions, but choosing exactly what solution would be best for you is dependent on a lot of unknown factors such as business requirements and ultimately the correct solution would be one that is sufficiently fast without sacrificing maintainability/sanity/hair.

answered Sep 29 '22 08:09

semchev

Related questions
                            
                                What's textmate's 'Go to File' fuzzy search algorithm?
                            
                                Suitable choice of data structure and algorithm for fast k-Nearest Neighbor search in 2D
                            
                                Is there any way to efficiently reconstruct a collection based on a sequence of inserts/removals?
                            
                                optimal negative space between rectangles algorithm?
                            
                                How Does Facebook Determine "Suggested Friends"? [closed]
                            
                                What's the equivalent 'nth_element' function in Java?
                            
                                dynamic programming algorithm during an interview [closed]
                            
                                Snake-alike fluid layout algorithm
                            
                                efficiently find the first element matching a bit mask
                            
                                Avoid collision between nodes and edges in D3 force layout
                            
                                How can I figure out which tiles move and merge in my implementation of 2048?
                            
                                How to balance number of ratings versus the ratings themselves?
                            
                                How does a GPS receiver synchronize its quartz clock with GPS satellites? [closed]
                            
                                How to find feedback edge set in undirected graph
                            
                                Longest matching substring irrespective of the order of characters
                            
                                Calculate all possibilities to get N using values from a given set [duplicate]
                            
                                What algorithms exist to minimize the number of transactions between nodes in a graph?
                            
                                Finding largest subset of points forming a convex polygon
                            
                                Maximize number of subgraphs with a given minimum weight
                            
                                opencv: Best way to detect corners on chessboard

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to improve performance of matching algorithm

Tags:

performance

algorithm

firebase

firebase-realtime-database

N Sharma

People also ask

2 Answers

Yavar

semchev

Recent Activity

Donate For Us