How to efficiently find similar strings in a unique string in JavaScript?

Tags:

algorithm

Background: I have a list that contains 13,000 records of human names, some of them are duplicates and I want to find out the similar ones to do the manual duplication process.

For an array like:

["jeff","Jeff","mandy","king","queen"]

What would be an efficient way to get:

[["jeff","Jeff"]]

Explanation ["jeff","Jeff"] since their Levenshtein distance is 1(which can be variable like 3).

/* 
Working but a slow solution
*/
function extractSimilarNames(uniqueNames) {
  let similarNamesGroup = [];

  for (let i = 0; i < uniqueNames.length; i++) {
    //compare with the rest of the array
    const currentName = uniqueNames[i];

    let suspiciousNames = [];

    for (let j = i + 1; j < uniqueNames.length; j++) {
      const matchingName = uniqueNames[j];
      if (isInLevenshteinRange(currentName, matchingName, 1)) {
        suspiciousNames.push(matchingName);
        removeElementFromArray(uniqueNames, matchingName);
        removeElementFromArray(uniqueNames, currentName);
        i--;
        j--;
      }
    }
    if (suspiciousNames.length > 0) {
      suspiciousNames.push(currentName);
    }
  }
  return similarNamesGroup;
}

I want to find the similarity via Levenshtein distance, not only the lower/uppercase similarity

I already find one of the fastest Levenshtein implementation but it still takes me to 35 mins to get the result of 13000 items list.

789

asked Apr 23 '19 04:04

3 Answers

Your problem is not the speed of the Levenshtein distance implementation. Your problem is that you have to compare each word with each other word. This means you make 13000² comparisons (and each time calculate the Levenshtein distance).

So my approach would be to try to reduce the number of comparisons.

Here are some ideas:

words are only similar if their lengths differ less than 20% (just my estimation)
→ we can group by length and only compare words with other words of length ±20%
words are only similar if they share a lot of letters
→ we can create a list of e.g. 3-grams (all lower case) that refer to the words they are part of.
→ only compare (e.g. with Levenshtein distance) a word with other words that have several 3-grams in common with it.

104

answered Nov 13 '22 12:11

MrSmith42

Approaches to remove similar names:

Use phonetical representation of the words. cmudict It works with python nltk. You can find which names are close to each other phonetically.
Try different forms of stemming or simplifications. I would try most aggressive stemmers like Porter stemmer.
Levenshtein trie. You can create trie data structure that will help to find word with minimum distance to searched item, this is used for full text search in some search engines. As far as I know it's already implemented in Java. In your case you need to search one item then add it to the structure on every step, you need to make sure that item that you search is not in the structure yet.
Manual naive approach. Find all suitable representations of every word/name, put all representations to map and find representations that have more than 1 word. If you have around 15 different representations of one word you will need only 280K iterations to generate this object (much faster than compare each word to another, which requires around 80M comparisons with 13K names).

-- Edit --

If there is a choice I would use something like Python or Java instead of JS for this. It's only my opinion based on: I don't know all requirements, it's common to use Java/Python for natural language processing, task looks more like heavy data processing than front end.

answered Nov 13 '22 12:11

varela

As in your working code you use Levenshtein distance 1 only, I will assume no other distances need to be found.

I will propose a similar solution as Jonas Wilms posted, with these differences:

No need to call a isLevenshtein function
Produces only unique pairs
Each pair is lexically ordered

// Sample data with lots of similar names
const names = ["Adela","Adelaida","Adelaide","Adele","Adelia","AdeLina","Adeline",
               "Adell","AdellA","Adelle","Ardelia","Ardell","Ardella","Ardelle",
               "Ardis","Madeline","Odelia","ODELL","Odessa","Odette"];

const map = {};
const pairs = new Set;
for (const name of names) {
    for (const i in name+"_") { // Additional iteration to NOT delete a character
        const key = (name.slice(0, i) + name.slice(+i + 1, name.length)).toLowerCase();
        // Group words together where the removal from the same index leads to the same key
        if (!map[key]) map[key] = Array.from({length: key.length+1}, () => new Set);
        // If NO character was removed, put the word in EACH group
        for (const set of (+i < name.length ? [map[key][i]] : map[key])) {
            if (set.has(name)) continue;
            for (let similar of set) pairs.add(JSON.stringify([similar, name].sort()));
            set.add(name);
        }
    }
}
const result = [...pairs].sort().map(JSON.parse); // sort is optional
console.log(result);

I tested this on a set of 13000 names, including at least 4000 different names, and it produced 8000 pairs in about 0.3 seconds.

answered Nov 13 '22 14:11

trincot

Related questions
                            
                                Rxjs vs Lodash ? can rxjs be an alternative for lodash?
                            
                                Number of requests until last reset
                            
                                How to redirect to another page after serving a post request in Node.js?
                            
                                Extract color palette of a webpage [closed]
                            
                                How can I get Web Components to compile with TypeScript for IE11/Edge/Chrome/Firefox?
                            
                                Is there any way to turn an existing Javascript object into an array without creating a new separate array?
                            
                                Connecting NextJS, next-i18next, with-redux, with-redux-saga: "Error: If you have a getInitialProps method in your custom _app.js file..."
                            
                                alias was not found. Maybe you forget to join it
                            
                                What's the difference between cross-fetch and isomorphic-fetch?
                            
                                How to create default value in field using sanity.io?
                            
                                How to call two async functions every `n` and `m` seconds within a `while (true)` loop?
                            
                                How to pass URL parameters across all pages/internal links in a site with javascript?
                            
                                Input cursor jumps to end of input field on input event
                            
                                Capturing events from slotted content in containing web component
                            
                                Measure Babel compilation performance (per file or module)
                            
                                Can I specify a different font size for each row of text in Chart Title?
                            
                                Why does my API call work in chrome but not in my code?
                            
                                JS Fetch API - When should I use credentials option with "omit" value if by default fetch won't send or receive any cookies from the server?
                            
                                Where and when to use knex.destroy?
                            
                                how to solve the problem of "folly/Portability.h" and "File .../main.jsbundle does not exist" in React Native?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to efficiently find similar strings in a unique string in JavaScript?

Tags:

javascript

algorithm

Jeff Chung

People also ask

3 Answers

MrSmith42

varela

trincot

Recent Activity

Donate For Us