Algorithm to find related words in a text

Tags:

I would like to have a word (e.g. "Apple) and process a text (or maybe more). I'd like to come up with related terms. For example: process a document for Apple and find that iPod, iPhone, Mac are terms related to "Apple".

Any idea on how to solve this?

789

asked Sep 25 '11 07:09

Andrew

3 Answers

As a starting point: your question relates to text mining.

There are two ways: a statistical approach, and one form natural language processing (nlp).

I do not know much about nlp, but can say something about the statistical approach:

You need some vector space representation of your documents, see http://en.wikipedia.org/wiki/Vector_space_model http://en.wikipedia.org/wiki/Document-term_matrix http://en.wikipedia.org/wiki/Tf%E2%80%93idf
In order to learn semantics, that is: different words mean the same, or one word can have different meanings, you need a large text corpus for learning. As I said this is a statistical approach, so you need lots of samples. http://www.daviddlewis.com/resources/testcollections/

Maybe you have lots of documents from the context you are going to use. That is the best situation.
You have to retrieve latent factors from this corpus. Most common are:
- LSA (http://en.wikipedia.org/wiki/Latent_semantic_analysis)
- PLSA (http://en.wikipedia.org/wiki/Probabilistic_latent_semantic_analysis)
- nonnegative matrix factorization (http://en.wikipedia.org/wiki/Non-negative_matrix_factorization)
- latent dirichlet allocation (http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)
These methods involve lots of math. Either you dig it, or you have to find good libraries.

I can recommend the following books:

http://www.oreilly.de/catalog/9780596529321/toc.html
http://www.oreilly.de/catalog/9780596516499/index.html

147

answered Nov 13 '22 12:11

rocksportrocker

Like all of AI, it's a very difficult problem. You should look into natural language processing to learn about some of the issues.

One very, very simplistic approach can be to build a 2d-table of words, with for each pair of words the average distance (in words) that they appear in the text. Obviously you'll need to limit the maximum distance considered, and possibly the number of words as well. Then, after processing a lot of text you'll have an indicator of how often certain words appear in the same context.

answered Nov 13 '22 10:11

sinelaw

What I would do is get all the words in a text and make a frequency list (how often each word appears). Maybe also add to it a heuristic factor on how far the word is from "Apple". Then read multiple documents, and cross out words that are not common in all the documents. Then prioritize based on the frequency and distance from the keyword. Of course, you will get a lot of garbage and possibly miss some relevant words, but by adjusting the heuristics you should get at least some decent matches.

answered Nov 13 '22 10:11

Radu

Related questions
                            
                                How could revise the recursive algorithm to find the shortest path?
                            
                                Jacobian matrix computation for artificial neural networks
                            
                                Connect 4 with neural network: evaluation of draft + further steps
                            
                                Admissible Heuristic Manhattan Distance
                            
                                AI Programming Resources with a focus on Web Applications
                            
                                Artifical intelligence that can learn [closed]
                            
                                Bound the runtime of a computation in haskell
                            
                                Can a transposition table cause search instability
                            
                                How Do I Run Sutton and Barton's "Reinforcement Learning" Lisp Code?
                            
                                Top down Game AI
                            
                                How to make career guidance system intelligent
                            
                                Natural Language Understanding API [closed]
                            
                                Why is the complexity of Arc-Consistency Algorithm O(cd^3)?
                            
                                Deep Q Network is not learning
                            
                                What is the meaning of <- in AI?
                            
                                Recognizing barcodes with AI
                            
                                2d trilateration
                            
                                What algorithm should I use for "genetic AI improvement"
                            
                                How to obtain the path in the "uniform-cost search" algorithm?
                            
                                What is the difference between the train loss and train error?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Algorithm to find related words in a text

Tags:

artificial-intelligence

similarity