Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm to find related words in a text

I would like to have a word (e.g. "Apple) and process a text (or maybe more). I'd like to come up with related terms. For example: process a document for Apple and find that iPod, iPhone, Mac are terms related to "Apple".

Any idea on how to solve this?

like image 789
Andrew Avatar asked Sep 25 '11 07:09

Andrew


People also ask

How do you search for keywords in a document?

To open the Find pane from the Edit View, press Ctrl+F, or click Home > Find. Find text by typing it in the Search the document for… box.

What is text processing machine learning?

Machine learning for NLP and text analytics involves a set of statistical techniques for identifying parts of speech, entities, sentiment, and other aspects of text. The techniques can be expressed as a model that is then applied to other text, also known as supervised machine learning.


3 Answers

As a starting point: your question relates to text mining.

There are two ways: a statistical approach, and one form natural language processing (nlp).

I do not know much about nlp, but can say something about the statistical approach:

  1. You need some vector space representation of your documents, see http://en.wikipedia.org/wiki/Vector_space_model http://en.wikipedia.org/wiki/Document-term_matrix http://en.wikipedia.org/wiki/Tf%E2%80%93idf

  2. In order to learn semantics, that is: different words mean the same, or one word can have different meanings, you need a large text corpus for learning. As I said this is a statistical approach, so you need lots of samples. http://www.daviddlewis.com/resources/testcollections/

    Maybe you have lots of documents from the context you are going to use. That is the best situation.

  3. You have to retrieve latent factors from this corpus. Most common are:

    • LSA (http://en.wikipedia.org/wiki/Latent_semantic_analysis)
    • PLSA (http://en.wikipedia.org/wiki/Probabilistic_latent_semantic_analysis)
    • nonnegative matrix factorization (http://en.wikipedia.org/wiki/Non-negative_matrix_factorization)
    • latent dirichlet allocation (http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)

    These methods involve lots of math. Either you dig it, or you have to find good libraries.

I can recommend the following books:

  • http://www.oreilly.de/catalog/9780596529321/toc.html
  • http://www.oreilly.de/catalog/9780596516499/index.html
like image 147
rocksportrocker Avatar answered Nov 13 '22 12:11

rocksportrocker


Like all of AI, it's a very difficult problem. You should look into natural language processing to learn about some of the issues.

One very, very simplistic approach can be to build a 2d-table of words, with for each pair of words the average distance (in words) that they appear in the text. Obviously you'll need to limit the maximum distance considered, and possibly the number of words as well. Then, after processing a lot of text you'll have an indicator of how often certain words appear in the same context.

like image 42
sinelaw Avatar answered Nov 13 '22 10:11

sinelaw


What I would do is get all the words in a text and make a frequency list (how often each word appears). Maybe also add to it a heuristic factor on how far the word is from "Apple". Then read multiple documents, and cross out words that are not common in all the documents. Then prioritize based on the frequency and distance from the keyword. Of course, you will get a lot of garbage and possibly miss some relevant words, but by adjusting the heuristics you should get at least some decent matches.

like image 41
Radu Avatar answered Nov 13 '22 10:11

Radu