Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dense vector array and cosine similarity

I would like to store an array of dense_vector in my document but this does not work as it does for other data types eg.

PUT my_index
{
  "mappings": {
    "properties": {
      "my_vectors": {
        "type": "dense_vector",
        "dims": 3  
      },
      "my_text" : {
        "type" : "keyword"
      }
    }
  }
}

PUT my_index/_doc/1
{
  "my_text" : "text1",
  "my_vector" : [[0.5, 10, 6], [-0.5, 10, 10]]
}

returns:

'1 document(s) failed to index.',
    {'_index': 'my_index', '_type': '_doc', '_id': 'some_id', 'status': 400, 'error': 
      {'type': 'mapper_parsing_exception', 'reason': 'failed to parse', 'caused_by': 
        {'type': 'parsing_exception', 
         'reason': 'Failed to parse object: expecting token of type [VALUE_NUMBER] but found [START_ARRAY]'
        }
      }
    }

How do I achieve this? Different documents will have a variable number of vectors but never more than a handful.

Also, I would then like to query it by performing a cosineSimilarity for each value in that array. The code below is how I normally do it when I have only one vector in the doc.

"script_score": {
    "query": {
        "match_all": {}
    },
    "script": {
        "source": "(1.0+cosineSimilarity(params.query_vector, doc['my_vectors']))",
        "params": {"query_vector": query_vector}
    }
}

Ideally, I would like the closest similarity or an average.

like image 929
Leo Avatar asked Jan 26 '23 03:01

Leo


1 Answers

The dense_vector datatype expects one array of numeric values per document like so:

PUT my_index/_doc/1
{
  "my_text" : "text1",
  "my_vector" : [0.5, 10, 6]
}

To store any number of vectors, you could make the my_vector field a "nested" type which would contain an array of objects, and each object contains a vector:

PUT my_index
{
  "mappings": {
    "properties": {
      "my_vectors": {
        "type": "nested",
        "properties": {
          "vector": {
            "type": "dense_vector",
            "dims": 3  
          }
        }
      },
      "my_text" : {
        "type" : "keyword"
      }
    }
  }
}

PUT my_index/_doc/1
{
  "my_text" : "text1",
  "my_vector" : [
    {"vector": [0.5, 10, 6]}, 
    {"vector": [-0.5, 10, 10]}
  ]
}

EDIT

Then, to query the documents, you can use the following (as of ES v7.6.1)

{
  "query": {
    "nested": {
      "path": "my_vectors",
      "score_mode": "max", 
      "query": {
        "function_score": {
          "script_score": {
            "script": {
              "source": "(1.0+cosineSimilarity(params.query_vector, 'my_vectors.vector'))",
              "params": {"query_vector": query_vector}
            }
          }
        }
      }
    }
  }
}

Few things to note:

  • The query needs to be wrapped in a nested declaration (due to using nested objects to store the vectors)
  • Because nested objects are separate Lucene documents, the nested objects are scored individually and by default, the parent document is assigned the average score of matching nested documents. You can specify the nested property score_mode to change the scoring behavior. For your case, "max" will score based on largest cosine similarity score which describes documents that are most similar.
  • If you're interested in seeing the scores of each nested vector, you can use the nested property inner_hits.
  • If anyone is curious why +1.0 is added to the cosine similarity score, it's because Cos. Sim. computes values [-1,1], but ElasticSearch cannot have negative scores. Therefore, scores are transformed to [0,2].
like image 61
Glen Smith Avatar answered Jan 31 '23 08:01

Glen Smith