I was wondering if there's a NLP/ML technique for this.
Suppose given a set of sentences,
If i have to assign a probability to each of these sentences, that they have "actually" watched the movie, i would assign it in decreasing order of 1,4,3,2.
Is there a way to do this automatically, using some classifier or rules? Any paper/link would help.
These are common issues in textual entailment. I'll refer you to some papers. While their motivation is for textual entailment, I believe your problem should be easier than that.
Determining Modality and Factuality for Textual Entailment
Learning to recognize features of valid textual entailments
Some of these suggestions should help you decide on some features/keywords to consider when ranking.
Except 1, none of the other statements necessarily imply that the person has watched the movie. They could have bought the tickets for somebody else (3) and might be the person who sells popcorn outside the halls (4). I don't think there is any clever system out there that will read between the lines for each sentence and return an answer that exactly agrees with your intuitions (which might be different from that of other people for the same sentence btw).
If this strangely is the only case that you care about (which is possible if you are explicitly working with movie reviews), then it might be worth your time to come with a large number of heuristics patched together that yields a function that near exactly agrees with your intuitions about this.
Otherwise look for context available in all the other sentences these sentences originate from to find relevant clues. Somebody who has actually watched the movie may comment on how they liked it, express opinions about specific scenes, characters and actors from the movie, etc. So if the text contains a lot of high sentiment sentences and refers to words and phrases from the movie, then the person has probably watched the movie. If a lot of it is in future tense, then maybe not.
If you are working with an specific domain, such as "watched the movie or not", or maybe more generally "attended to an event or not", it's basically a case of the Text Classification task.
The common approach in NLP is to use a large amount of sentences tagged as watched or didn't watch to train a machine learning based classifier. The most commonly used features are the presence/absence of keywords, bigrams (sequences of 2 words) and maybe trigrams (sequences of 3 words).
Since you talked about probability, things may get a little more complex. As adi92 noted, in 3 of your sentences the answer is not clear. A way to represent that in the training data could be that a sentence with 0.3 probability of watched appear 3 times tagged as watched and 7 as didn't watch. Most classifiers can have their output easily turned into probabilities.
Anyway, I believe that the main difficulty would be creating a training dataset for the task.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With