Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Prioritizing text based on content

If you have a list of texts and a person interested in certain topics what are the algorithms dealing with choosing the most relevant text for a given person?

I believe that this is quite a complex topic and as an answer I expect a few directions to study various methodologies of text analysis, text statistics, artificial intelligence etc.

thank you

like image 281
xralf Avatar asked Feb 16 '26 23:02

xralf


1 Answers

There are quite a few algorithms out there for this task. At least way too many to mention them all here. First some starting points:

  • Topic discovery and recommendation are two quite distinctive tasks, although they often overlap. If you have a stable userbase, you might be able to give very good recommendations without any topic discovery.

  • Discovering topics and assigning names to them are also two different tasks. This means it is often easier to be able to tell that text A and text B share a similar topic, than to explicetly be able to state what this common topic might be. Giving names to the topics is best done by humans, for example by having them tag the items.

Now to some actual examples.

  • TF-IDF is often a good starting point, however it also has severe drawbacks. For example it will not be able to tell that "car" and "truck" in two texts mean that these two probably share a topic.

  • http://websom.hut.fi/websom/ A Kohonen map for automatically clustering data. It learns the topics and then organizes the texts by topics.

  • http://de.wikipedia.org/wiki/Latent_Semantic_Analysis Will be able to boost TF-IDF by detecting semantic similarity among different words. Also note, that this has been patented, so you might not be able to use it.

  • Once you have a set of topics assigned by users or experts, you can also try almost any kind of machine learning method (for example SVM) to map the TF-IDF data to topics.

like image 158
LiKao Avatar answered Feb 18 '26 15:02

LiKao