Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find the most similar text in a list of strings

First of all, I want to acknowledge that this is perhaps a very sophisticated problem; however, I have not been able to find a definitive answer for it online so I'm looking for suggestions.

Suppose I want to collect a list containing over hundred thousand strings, values of these strings are sentences that a user has typed. The values are added to the list as soon as a user types a new message. For example:

["Hello world!", "Good morning, my name is John", "Good morning, everyone"]

But I also want to have a timeout for each string so if they are not repeated within 5 min, they should be removed, so I change it to following format:

[{message:"Hello world!", timeout: NodeJS.Timeout, count: 1}, {message:"Good morning, my name is John", timeout: NodeJS.Timeout, count: 1}, {message:"Good morning, everyone", timeout: NodeJS.Timeout, count: 1}]

Now suppose a user types the following message:

Good morning, everyBODY I want to compare this string to all the messages in list and if one is 70% or more similar, update the count of that message, otherwise insert it as a new message. For this message for example, the application should update the count for Good morning, everyone to be equal to 2.

Since users can type a lot of messages in a short amount of time, the algorithm must also support fast insertion, searching, and deleting after the timeout.

What is the best way to implement this? or are there any libraries to help me with this?

NOTE: The strings do not need to be in an array, any data structure would work.

The main purpose of this algorithm is to detect similar messages when the count reaches a predefined value. For example warning: Over 5 users typed messages similar to "Hello everybody" within 5 minutes

I have looked at B-Trees, Nearest Neighbor, etc but I can't figure out what would be the best solution.

Update: I plan on using Levenshtein distance for string similarity, however the main problem is how to apply that to a list of strings in most time efficient way, without having to check every single string every time a new message is added.

like image 211
Aiden Avatar asked Sep 20 '20 23:09

Aiden


1 Answers

Levenshtein distance

Unlike the other answer I think Levenshtein distance is perfectly capable of dealing with spelling mistakes. Indeed, Levenshtein and LevXenshtein only have Levenshtein distance 1, and thus can be concluded to likely be the same message.

However, if you want to use this distance, you will have to compute the distance between the new message and every message stored, every time a new message comes in. There is likely no way around this.

Unfortunately there is no real useful pre-processing you can do for this.

Other possibilities

If you can find a way to map every message to a fixed-size vector, you can use essentially any nearest neighbor search technique. I suggest doing so.

This leaves us with two problems to solve. Generating the fixed-length vector, and doing the search.

Fixed-size vector representation

There are multiple ways of doing this, all with their own set of drawbacks. I'll specifically mention two, but it will depend on your architecture and data which method is best for you.

First, you could go the machine-learning way. You could map every word to a pre-trained vector with fastText, average the words in the message, and use that as your vector. The drawbacks of this method are that it will ignore word order, and it will work less well if the words used tend to be very informal. If your messages have their own culture to them (such as for example Twitch chat) you would have to retrain these vectors instead of using pre-trained ones.

Alternatively, you could use the structure of the text directly, and make an occurrence vector of bigrams. That is, jot down how often every 2-character combination occurs in a message. This is fairly robust, but has the drawback that the vectors will become relatively large.

Regardless, these are just two options, and it's impossible to tell what method is ideal for you. Unless of course someone has a brilliant idea.

Nearest neighbor search

Given that we have fixed length vectors, we can now do nearest neighbor search. As you've probably found, there are once again many different methods for this, all with their own drawbacks. Exhausting, I know.

I'll choose to discuss three categories.

  1. Approximate search: This method may seem a little silly, but it could be what you want. Specifically, Locality-sensitive hashing is essentially just making some hashing function where "similar" vectors are likely to end up in the same bucket. You could then do anything you want, such as Levenshtein, with all of the other members of the bucket, because there should not be too many of them. The advantage of such an approximate algorithm is that it can be fast, and with some smart hashing you don't even need fixed-length vectors. A downside, of course, is that it is not guaranteed to work.

  2. Exact search: We can also choose to instead solve the problem of Fixed-radius near neighbors. That is, find the points within some distance of the target point. You could do this by mapping vectors to integers (if they aren't already) and simply checking every lattice point within the distance you want to search. The primary drawback here is that the search time grows very fast not with the number of points, but with the number of dimensions of the vector. This method would necessitate small vectors.

  3. Fancy datastructures: This seems to me most likely to be the right solution. Unfortunately you have a lot of letter-trees. You mention B-trees, but there's also R-trees, R+-Trees, R*-Trees, X-Trees, and that's just the direct descendants of the R-tree. With the risk of missing the trees for the forest, I'd suggest taking a look at the k-d tree. It can do nearest neighbor search in logarithmic time, as well as insertion and deletion.

like image 118
ADdV Avatar answered Sep 22 '22 14:09

ADdV