Measuring precision and recall when the raw data is missing information

Question

Trying to improve my chat App:

Using previous (pre-processed) chat interactions from my domain, I have built a tool that offers the user 5 possible utterances to a given chat context, for example:

Raw: "Hi John."

Context: hi [[USER_NAME]]
Utterances : [Hi ,Hello , How are you, Hi there, Hello again]

Of Course the results are not always relevant, for example:

Raw: "Hi John. How are you? I am fine, are you in the office?"

Context: hi [[USER_NAME]] how are you i am fine are you in the office
Utterances : [Yes, No , Hi , Yes i am, How are you]

I am using Elasticsearch with TF/IDF similarity model and an index structured like so:

{
  "_index": "engagements",
  "_type": "context",
  "_id": "48",
  "_score": 1,
  "_source": {
    "context": "hi [[USER_NAME]] how are you i am fine are you in the office",
    "utterance": "Yes I am"
  }
}

Problem: I know for sure that for the context "hi [[USER_NAME]] how are you i am fine are you in the office" the utterance "Yes I am" is relevant, however "Yes" , "No" are relevant too because they appeared on a similar context.

Trying to use this excellent video, as a starting point

Q: How can I measure precision and recall, if all I know (from my raw data) is just one true utterance?

sophros · Accepted Answer

I think the main question is if any of the acceptable answers is better than others? (Is there an order of relevance?) If not, then any of the answers from the list of acceptable ones is TP. In case there is some order of relevance, you could incorporate this in as a degree of TP and degree of FP:

answers: A < B < C < D

D - best; A - worst but still acceptable

assigned contributions to TP:

A - 0.5 + 1/4*(1-0.5) = 0.625

D - TP: 1.0; FP: 0.0

A - TP: 0.625; FP: 1-0.625 = 0.375

In such a case any answer that is not the best is partially wrong but since it is still in the correct set, the associated contribution to TP should not be smaller than 0.5 (because its complement is contributing to FP and an even borderline correct answer should not be seen as more "bad" than "good").

As you can see above, I am treating order penalty linearly. You can of course introduce any penalty function here you deem appropriate if the first answer is particularly better than the rest, etc.

Measuring precision and recall when the raw data is missing information

Tags:

chat

elasticsearch

classification

tf-idf

precision-recall

Shlomi Schwartz

1 Answers

sophros

Recent Activity

Donate For Us

Measuring precision and recall when the raw data is missing information

Tags:

chat

elasticsearch

classification

tf-idf

precision-recall

Shlomi Schwartz

1 Answers

sophros

Related questions

Recent Activity

Donate For Us