Consider an arbitrary text box that records the answer to the question, what do you want to do before you die? Using a collection of response strings (max length 240), I'd like to somehow sort and group them and count them by idea (which may be just string similarity as described in this question). <ol> <li>Is there another or better way to do something like this?</li> <li> Is this any different than string similarity?</li> <li>Is this the right question to be asking?</li> </ol> The idea here is to have people write in a text box over and over again, and me to provide a number that describes, generally speaking, that 802 people wrote approximately the same thing

It is much more difficult than string similarity. This is what you need to do at a minimum: <ul> <li>Perform some text formatting/cleaning tasks like removing punctuations characters and common "stop words"</li> <li>Construct a corpus (collection of words with their usage statistics) from the terms that occur answers.</li> <li>Calculate a weight for every term.</li> <li>Construct a document vector from every answer (each term corresponds to a dimension in a very high dimensional Euclidian space)</li> <li>Run a clustering algorithm on document vectors.</li> </ul> Read a good statistical natural language processing book, or search google for good introductions / tutorials (likely terms: statistical nlp, text categorization, clustering) You can probably find some libraries (weka or nltk comes to mind) depending on the language of your choice but you need to understand the concepts to use the library anyway.

Algorithm to compare similarity of ideas (as strings)

2 Answers

It is much more difficult than string similarity. This is what you need to do at a minimum:

Perform some text formatting/cleaning tasks like removing punctuations characters and common "stop words"
Construct a corpus (collection of words with their usage statistics) from the terms that occur answers.
Calculate a weight for every term.
Construct a document vector from every answer (each term corresponds to a dimension in a very high dimensional Euclidian space)
Run a clustering algorithm on document vectors.

Read a good statistical natural language processing book, or search google for good introductions / tutorials (likely terms: statistical nlp, text categorization, clustering) You can probably find some libraries (weka or nltk comes to mind) depending on the language of your choice but you need to understand the concepts to use the library anyway.

answered Sep 23 '22 15:09

Ali Ferhat

The Latent Semantic Analysis (LSA) might interest you. Here is a nice introduction.

Latent semantic analysis (LSA) is a technique in natural language processing, in particular in vectorial semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. [...]

answered Sep 21 '22 15:09

Franck Dernoncourt

Related questions
                            
                                Math question regarding a Fantasy Sports (snake) draft
                            
                                A good approximation algorithm for the maximum weight perfect match in non-bipartite graphs?
                            
                                Digital image processing with MATLAB using 3 techniques
                            
                                Random password generation with conditions
                            
                                Solving a variation of 0/1 Knapsack (multiple source for items, each item can be selected from one of the sources)
                            
                                How do I minimise the maximum aspect ratio of two subpolygons?
                            
                                What happens when I type the wrong password?
                            
                                Arrange the list in sequence
                            
                                Ratio of leaves to total nodes in a Fibonacci call stack
                            
                                Advice on writing math equations into code
                            
                                calculate product variants based on option groups and options
                            
                                When does a std::priority_queue<> sort itself?
                            
                                Any algorithm to find the shortest path/distance in android?
                            
                                Need help maximizing 3 factors in multiple, similar objects and ordering appropriately
                            
                                Generating random numbers, each of a minimum size
                            
                                Finding which bin a values fall into
                            
                                Fast DCT transformation
                            
                                Distributing integers using weights? How to calculate?
                            
                                finding n largest differences between two lists
                            
                                How does Richardson–Lucy algorithm work? Code example?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Algorithm to compare similarity of ideas (as strings)

Tags:

algorithm

artificial-intelligence

nlp

Kristian

People also ask

2 Answers

Ali Ferhat

Franck Dernoncourt

Recent Activity

Donate For Us