I have a couple of non-English texts. I would like to perform stylistic comparisons on them.
One method of comparing style is to look for similar phrases. If I find in one book "fishing, skiing and hiking" a couple of times and in another book "fishing, hiking and skiing" the similarity in style points to one author. I need to also be able to find "fishing and even skiing or hiking" though. Ideally I would also find "angling, hiking and skiing" but because they are non-English texts (Koine Greek), synonyms are harder to allow for and this aspect is not vital.
What is the best way to (1) go about detecting these sorts of phrases and then (2) searching for them in a way that is not overly rigid in other texts (so as to find "fishing and even skiing or hiking")?
Technical details : For the vocabulary, you have several possibilities to get a good vocabulary. Unfortunately, I can't remember the names. One of them consists of deleting words that are present often and everywhere. On the contrary, you should keep rare words that are present in few texts. However, there is no use in conserving words present exactly in one text.
For the adjacency matrix, the adjacency is measured is done by counting how far the words you are considering are (couting the number of words separating them). For example, let's use your very text =)
One method of comparing style is to look for similar phrases. If I find in one book "fishing, skiing and hiking" a couple of times and in another book "fishing, hiking and skiing" the similarity in style points to one author. I need to also be able to find "fishing and even skiing or hiking" though. Ideally I would also find "angling, hiking and skiing" but because they are non-English texts (Koine Greek), synonyms are harder to allow for and this aspect is not vital.
These are entirely made up values :
A(method, comparing) += 1.0
A(method, similarity) += 0.5
A(method, Greek) += 0.0
You mainly need a "typical distance". You can say for example that after 20 separation-words, then the words can't be considered adjacent anymore.
After a bit of normalization, just make a L2 distance between the adjacency matrix of two texts to see how close they are. You can do fancier stuff afterwards, but this should yield acceptable results. Now, if you have synonyms, you can update the adjacency in a nice way. For example, if you have in input "beautiful maiden", then
A(beautiful, maiden) += 1.0
A(magnificent, maiden) += 0.9
A(fair, maiden) += 0.8
A(sublime, maiden) += 0.8
...
You should probably use some string similarity measure such as Jaccard, Dice or cosine similarity. You could try these either on words, on (word or character-level) n-grams or on lemmas. (For a highly inflected language such as Koinè Greek, I would suggest using lemmas if you have a good lemmatizer for it.)
Catching synonyms is hard unless you have something like WordNet, which maps synonyms together.
I would follow two guidelines:
angling
is a very close match for fishing
.As a self-learning AI, I would use (at least for a start) a neural network. There is an easy and fully working example (in python) that can be found here and that targets precisely "data mining". You might wish to implement in some other language, of course.
About your two specific questions:
What is the best way to go about detecting these sorts of phrases
Other answers to your question have gone in details about this (and their authors seem to know way more than I do on the subject!), but again: I would start easy and simply use a neural network that tells you how close two terms are. Then I would proceed with "waves" of optimisation (for example - if it was an English text - using only the root of the word, or maybe it is of some use to tweak the score according to some other metadata of the text like year, or author, or geographical origin, or yet changing the matching algorithm altogether...) until you are satisfied with the outcome.
What is the best way to go for searching for them in a way that is not overly rigid in other texts (so as to find "fishing and even skiing or hiking"
I would say this is equivalent to ask the AI to return all phrases whose "proximity score" is over a given threshold.
HTH!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With