I need to implement algorithm (or find one in an open source library) for evaluation of text similarities. I need an efficient algorithm for given two arbitrary sets of documents (relatively small number of big chunks of text) it to create a matching pairs between them - which document is most likely to be produced from which one.
I believe I will split this in two - defining the similarity coefficient of every pair - and then applying some of the assignment problem algorithms. While for the assignment algorithms I can find a good number of solutions I cannot find a good one for the computing the similarity coefficients.
Note the documents are not known in advance - computing indexes of the text (if there is) must be fast as well.
I am aware of Hamming distance, Levenshtein distance some of the other algorithms for string difference. This is not what I am looking for though - I am using the word text instead string on purpose.
I am not looking for phrase search algorithms as well what libraries like Lucene and Xapian are made for (at least seems to be).
Probably something based on tf–idf.
I guess the question is, is there something that already solves this problem or is it possible libraries like lucete to be used to do that.
Here is what I would do as a starting point (just because it is simple and fast):
We can assume that the dictionary size is < 1m (or 21bit), so we can just encode a trigram in an int64.
void CountTrigrams(const vector<string>& words,
map<string, int> * dict,
map<int64, int> * result) {
int64 trigram = 0;
for (int i = 0; i < words.size(); i++) {
const& word = words[i];
int id;
auto di = dict->find(word);
if (di == dict->end()) {
id = dict.size();
dict[word] = id;
} else {
id = di->second;
}
trigram = ((trigram << 21) | id) & 0x7fffffffffffffff;
if (i > 2) {
auto ti = result->find(trigram);
if (ti == result->end()) {
result[trigram] = 1;
} else {
ti->second++;
}
}
}
}
Then compare the results for each pair:
int Compare(const map<int64, int> & t1, const map<int64, int> & t2) {
int score = 0;
for (auto i = t1.first(); i != t1.end(); i++) {
auto j = t2.find(t1->first);
if (j != t2.end()) {
score += MAX(i->second, j->second);
}
}
return score;
}
It may make sense to normalize the score somehow, e.g. divide by the total number of trigrams.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With