Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

algorithm/library for text similarity [closed]

I need to implement algorithm (or find one in an open source library) for evaluation of text similarities. I need an efficient algorithm for given two arbitrary sets of documents (relatively small number of big chunks of text) it to create a matching pairs between them - which document is most likely to be produced from which one.

I believe I will split this in two - defining the similarity coefficient of every pair - and then applying some of the assignment problem algorithms. While for the assignment algorithms I can find a good number of solutions I cannot find a good one for the computing the similarity coefficients.

Note the documents are not known in advance - computing indexes of the text (if there is) must be fast as well.

I am aware of Hamming distance, Levenshtein distance some of the other algorithms for string difference. This is not what I am looking for though - I am using the word text instead string on purpose.

I am not looking for phrase search algorithms as well what libraries like Lucene and Xapian are made for (at least seems to be).

Probably something based on tf–idf.

I guess the question is, is there something that already solves this problem or is it possible libraries like lucete to be used to do that.

like image 921
gsf Avatar asked Nov 13 '22 05:11

gsf


1 Answers

Here is what I would do as a starting point (just because it is simple and fast):

  • Map the words to numbers using a shared map or hash_map
  • For each text, build the corresponding map of word-level trigram counts
  • Compare the overlap

We can assume that the dictionary size is < 1m (or 21bit), so we can just encode a trigram in an int64.

void CountTrigrams(const vector<string>& words, 
                   map<string, int> * dict, 
                   map<int64, int> * result) {
  int64 trigram = 0;
  for (int i = 0; i < words.size(); i++) {
    const& word = words[i];
    int id;
    auto di = dict->find(word);
    if (di == dict->end()) {
      id = dict.size();
      dict[word] = id;
    } else {
      id = di->second;
    }
    trigram = ((trigram << 21) | id) & 0x7fffffffffffffff;
    if (i > 2) {
      auto ti = result->find(trigram);
      if (ti == result->end()) {
        result[trigram] = 1;
      } else {
        ti->second++;
      }
    }
  }
}

Then compare the results for each pair:

int Compare(const map<int64, int> & t1, const map<int64, int> & t2) {
  int score = 0;
  for (auto i = t1.first(); i != t1.end(); i++) {
    auto j = t2.find(t1->first);
    if (j != t2.end()) {
      score += MAX(i->second, j->second);
    }
  }
  return score;
}

It may make sense to normalize the score somehow, e.g. divide by the total number of trigrams.

like image 129
Stefan Haustein Avatar answered Nov 14 '22 21:11

Stefan Haustein