Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm to Detect and Compare Phrases

I have a couple of non-English texts. I would like to perform stylistic comparisons on them.

One method of comparing style is to look for similar phrases. If I find in one book "fishing, skiing and hiking" a couple of times and in another book "fishing, hiking and skiing" the similarity in style points to one author. I need to also be able to find "fishing and even skiing or hiking" though. Ideally I would also find "angling, hiking and skiing" but because they are non-English texts (Koine Greek), synonyms are harder to allow for and this aspect is not vital.

What is the best way to (1) go about detecting these sorts of phrases and then (2) searching for them in a way that is not overly rigid in other texts (so as to find "fishing and even skiing or hiking")?

like image 237
jcuenod Avatar asked Jun 30 '11 11:06

jcuenod


3 Answers

  • Take all your texts, and build a list of the words. Easy way : take all the words. Hard way : take only the relevant one (i.e : in English, "the" is never a pertinent word as it it used too often). Let's say you have V words in your vocabulary.
  • For each text, build an adjacency matrix A, which size is V*V. The row A(i) states how close the words in your vocabulary are to the i-th word V(i). For example, if V(i)="skiing", then A(i,j) is how close the word V(j) is to the word "skiing". You'd prefer a small vocabulary!

Technical details : For the vocabulary, you have several possibilities to get a good vocabulary. Unfortunately, I can't remember the names. One of them consists of deleting words that are present often and everywhere. On the contrary, you should keep rare words that are present in few texts. However, there is no use in conserving words present exactly in one text.

For the adjacency matrix, the adjacency is measured is done by counting how far the words you are considering are (couting the number of words separating them). For example, let's use your very text =)

One method of comparing style is to look for similar phrases. If I find in one book "fishing, skiing and hiking" a couple of times and in another book "fishing, hiking and skiing" the similarity in style points to one author. I need to also be able to find "fishing and even skiing or hiking" though. Ideally I would also find "angling, hiking and skiing" but because they are non-English texts (Koine Greek), synonyms are harder to allow for and this aspect is not vital.

These are entirely made up values :
A(method, comparing) += 1.0
A(method, similarity) += 0.5
A(method, Greek) += 0.0

You mainly need a "typical distance". You can say for example that after 20 separation-words, then the words can't be considered adjacent anymore.

After a bit of normalization, just make a L2 distance between the adjacency matrix of two texts to see how close they are. You can do fancier stuff afterwards, but this should yield acceptable results. Now, if you have synonyms, you can update the adjacency in a nice way. For example, if you have in input "beautiful maiden", then
A(beautiful, maiden) += 1.0
A(magnificent, maiden) += 0.9
A(fair, maiden) += 0.8
A(sublime, maiden) += 0.8
...

like image 176
B. Decoster Avatar answered Nov 15 '22 18:11

B. Decoster


You should probably use some string similarity measure such as Jaccard, Dice or cosine similarity. You could try these either on words, on (word or character-level) n-grams or on lemmas. (For a highly inflected language such as Koinè Greek, I would suggest using lemmas if you have a good lemmatizer for it.)

Catching synonyms is hard unless you have something like WordNet, which maps synonyms together.

like image 1
Fred Foo Avatar answered Nov 15 '22 17:11

Fred Foo


I would follow two guidelines:

  • Beware of premature optimisation in the matching algorithm. Start from a broad approach and then refinine it as per need (i.e. check if a simple "proximity" test gives good-enough results for dataset you know the answer for, and if not, tweak it until it does). In many cases you will find out that a highly optimised solution won't give results considerably different than your first rough attempt.
  • Use some sort of self-learning algorithm. This way you could feed the AI a number of texts that can make it smarter. Taking inspiration from your example: before trying to compare two target texts, I would feed a text on outdoor life. This way the AI would most probably learn by itself that angling is a very close match for fishing.

As a self-learning AI, I would use (at least for a start) a neural network. There is an easy and fully working example (in python) that can be found here and that targets precisely "data mining". You might wish to implement in some other language, of course.

About your two specific questions:

What is the best way to go about detecting these sorts of phrases

Other answers to your question have gone in details about this (and their authors seem to know way more than I do on the subject!), but again: I would start easy and simply use a neural network that tells you how close two terms are. Then I would proceed with "waves" of optimisation (for example - if it was an English text - using only the root of the word, or maybe it is of some use to tweak the score according to some other metadata of the text like year, or author, or geographical origin, or yet changing the matching algorithm altogether...) until you are satisfied with the outcome.

What is the best way to go for searching for them in a way that is not overly rigid in other texts (so as to find "fishing and even skiing or hiking"

I would say this is equivalent to ask the AI to return all phrases whose "proximity score" is over a given threshold.

HTH!

like image 1
mac Avatar answered Nov 15 '22 18:11

mac