Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Create array of unique combinations from array of strings

I'm writing something that takes a block of text and breaks it down into possible database queries that could be used to find similar blocks of text. (something similar to the "similar questions" list being generated while I type this) The basic process:

  1. Remove stop words from text
  2. Remove special characters
  3. From remaining text create an array of unique "stems"
  4. Create an array of possible combinations of the array of stems (where I'm stuck... kind of)

Here's what I have so far:

    //baseList starts with an empty array
    //candList starts with the array of unique stems
    //target is where the arrays of unique combinations are stored

    function createUniqueCombos(baseList,candList,target){

    for(var i=0;i<candList.length;i++){         

        //copy the base List
        var newList = baseList.slice(0);

        //add the candidate list item to the base list copy
        newList.push(candList[i]);

        //add the new array to the target array
        target.push(newList);   

        //re-call function using new array as baseList
        //and remaining candidates as candList
        var nextCandList = candList.slice(i + 1);       
        createUniqueCombos(newList,nextCandList,target);
    }

}

This works, but on blocks of text larger than 25 words or so, it crashes my browser. I realize that mathematically there could be a huge number of possible combinations. What I'd like to know is:

  1. Is there a more efficient way to do this?
  2. How could I define a min/max combination array length?
like image 501
HartyeTech Avatar asked Jul 10 '12 13:07

HartyeTech


1 Answers

I think your logic is fundamentally flawed because of how many combinations you're creating.

An approach I'd take would be;

  1. Split the text into individual words (we'll call this variable split_words)
  2. Remove special characters
  3. Remove short/ common words (and, or, I, a); either do this by length, or more intelligently by a black list of words
  4. Have a table (e.g. blocks) which has columns block_id and word
  5. Have a SQL query such as

    SELECT block_id FROM blocks 
    WHERE word IN (split_words) GROUP BY block_id 
    ORDER BY COUNT(*) DESC
    

and then you'll have a list of block_ids which are ordered depending on how many words in common the blocks have.

like image 134
Matt Avatar answered Oct 01 '22 01:10

Matt