Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I evaluate a text summarization tool?

I have written a system that summarizes a long document containing thousands of words. Are there any norms on how such a system should be evaluated in the context of a user survey?

In short, is there a metric for evaluating the time that my tool has saved a human? Currently, I was thinking of using the (Time taken to read the original document/Time taken to read the summary) as a way of determining the time saved, but are there better metrics?

Currently, I am asking the user subjective questions about the accuracy of the summary.

like image 762
Legend Avatar asked Mar 26 '12 20:03

Legend


2 Answers

There is also the very recent BERTScore metric (arXiv'19, ICLR'20, already almost 90 citations) that does not suffer from the well-known issues of ROUGE and BLEU.

Abstract from the paper:

We propose BERTScore, an automatic evaluation metric for text generation. Analogously to common metrics, BERTScore computes a similarity score for each token in the candidate sentence with each token in the reference sentence. However, instead of exact matches, we compute token similarity using contextual embeddings. We evaluate using the outputs of 363 machine translation and image captioning systems. BERTScore correlates better with human judgments and provides stronger model selection performance than existing metrics. Finally, we use an adversarial paraphrase detection task to show that BERTScore is more robust to challenging examples when compared to existing metrics.

  • Paper: https://arxiv.org/pdf/1904.09675.pdf

  • Code: https://github.com/Tiiiger/bert_score

  • Full reference:

    Zhang, Tianyi, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. "Bertscore: Evaluating text generation with bert." arXiv preprint arXiv:1904.09675 (2019).

like image 177
Antoine Avatar answered Nov 16 '22 02:11

Antoine


BLEU

  • Bleu measures precision
  • Bilingual Evaluation Understudy
  • Originally for machine translation(Bilingual)
  • W(machine generates summary) in (Human reference Summary)
  • That is how much the word (and/or n-grams) in the machine generated summaries appeared in the human reference summaries
  • The closer a machine translation is to a professional human translation, the better it is

ROUGE

  • Rouge measures recall

  • Recall Oriented Understudy for Gisting Evaluation -W(Human Reference Summary) In w(machine generates summary)

  • That is how much the words (and/or n-grams) in the machine generates summaries appeared in the machine generated summaries.

  • Overlap of N-grams between the system and references summaries. -Rouge N, ehere N is n-gram

    reference_text = """Artificial intelligence (AI, also machine intelligence, MI) is intelligence demonstrated by machines, in contrast to the natural intelligence (NI) displayed by humans and other animals. In computer science AI research is defined as the study of "intelligent agents": any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals. Colloquially, the term "artificial intelligence" is applied when a machine mimics "cognitive" functions that humans associate with other human minds, such as "learning" and "problem solving". See glossary of artificial intelligence. The scope of AI is disputed: as machines become increasingly capable, tasks considered as requiring "intelligence" are often removed from the definition, a phenomenon known as the AI effect, leading to the quip "AI is whatever hasn't been done yet." For instance, optical character recognition is frequently excluded from "artificial intelligence", having become a routine technology. Capabilities generally classified as AI as of 2017 include successfully understanding human speech, competing at a high level in strategic game systems (such as chess and Go), autonomous cars, intelligent routing in content delivery networks, military simulations, and interpreting complex data, including images and videos. Artificial intelligence was founded as an academic discipline in 1956, and in the years since has experienced several waves of optimism, followed by disappointment and the loss of funding (known as an "AI winter"), followed by new approaches, success and renewed funding. For most of its history, AI research has been divided into subfields that often fail to communicate with each other. These sub-fields are based on technical considerations, such as particular goals (e.g. "robotics" or "machine learning"), the use of particular tools ("logic" or "neural networks"), or deep philosophical differences. Subfields have also been based on social factors (particular institutions or the work of particular researchers). The traditional problems (or goals) of AI research include reasoning, knowledge, planning, learning, natural language processing, perception and the ability to move and manipulate objects. General intelligence is among the field's long-term goals. Approaches include statistical methods, computational intelligence, and traditional symbolic AI. Many tools are used in AI, including versions of search and mathematical optimization, neural networks and methods based on statistics, probability and economics. The AI field draws upon computer science, mathematics, psychology, linguistics, philosophy and many others. The field was founded on the claim that human intelligence "can be so precisely described that a machine can be made to simulate it". This raises philosophical arguments about the nature of the mind and the ethics of creating artificial beings endowed with human-like intelligence, issues which have been explored by myth, fiction and philosophy since antiquity. Some people also consider AI to be a danger to humanity if it progresses unabatedly. Others believe that AI, unlike previous technological revolutions, will create a risk of mass unemployment. In the twenty-first century, AI techniques have experienced a resurgence following concurrent advances in computer power, large amounts of data, and theoretical understanding; and AI techniques have become an essential part of the technology industry, helping to solve many challenging problems in computer science."""
    

Abstractive summarization

   # Abstractive Summarize       
   len(reference_text.split())
   from transformers import pipeline
   summarization = pipeline("summarization")
   abstractve_summarization = summarization(reference_text)[0]["summary_text"]

Abstractive Output

   In computer science AI research is defined as the study of "intelligent agents" Colloquially, the term "artificial intelligence" is applied when a machine mimics "cognitive" functions that humans associate with other human minds, such as "learning" and "problem solving" Capabilities generally classified as AI as of 2017 include successfully understanding human speech, competing at a high level in strategic game systems (such as chess and Go)

EXtractive summarization

   # Extractive summarize
   from sumy.parsers.plaintext import PlaintextParser
   from sumy.nlp.tokenizers import Tokenizer
   from sumy.summarizers.lex_rank import LexRankSummarizer
   parser = PlaintextParser.from_string(reference_text, Tokenizer("english"))
   # parser.document.sentences
   summarizer = LexRankSummarizer()
   extractve_summarization  = summarizer(parser.document,2)
   extractve_summarization) = ' '.join([str(s) for s in list(extractve_summarization)])

Extractive Output

Colloquially, the term "artificial intelligence" is often used to describe machines that mimic "cognitive" functions that humans associate with the human mind, such as "learning" and "problem solving".As machines become increasingly capable, tasks considered to require "intelligence" are often removed from the definition of AI, a phenomenon known as the AI effect. Sub-fields have also been based on social factors (particular institutions or the work of particular researchers).The traditional problems (or goals) of AI research include reasoning, knowledge representation, planning, learning, natural language processing, perception and the ability to move and manipulate objects.

Using Rouge to Evaluate abstractive Summary

  from rouge import Rouge
  r = Rouge()
  r.get_scores(abstractve_summarization, reference_text)

Using Rouge Abstractive summary output

  [{'rouge-1': {'f': 0.22299651364421083,
  'p': 0.9696969696969697,
  'r': 0.12598425196850394},
  'rouge-2': {'f': 0.21328671127225052,
  'p': 0.9384615384615385,
  'r': 0.1203155818540434},
  'rouge-l': {'f': 0.29041095634452996,
  'p': 0.9636363636363636,
  'r': 0.17096774193548386}}]

Using Rouge to Evaluate abstractive Summary

  from rouge import Rouge
  r = Rouge()
  r.get_scores(extractve_summarization, reference_text)

Using Rouge Extractive summary output

  [{'rouge-1': {'f': 0.27860696251962963,
  'p': 0.8842105263157894,
  'r': 0.16535433070866143},
  'rouge-2': {'f': 0.22296172781038814,
  'p': 0.7127659574468085,
  'r': 0.13214990138067062},
  'rouge-l': {'f': 0.354755780824869,
  'p': 0.8734177215189873,
  'r': 0.22258064516129034}}]

Interpreting rouge scores

ROUGE is a score of overlapping words. ROUGE-N refers to overlapping n-grams. Specifically:

ROUGE Formula

I tried to simplify the notation when compared with the original paper. Let's assume we are calculating ROUGE-2, aka bigram matches. The numerator ∑s loops through all bigrams in a single reference summary and calculates the number of times a matching bigram is found in the candidate summary (proposed by the summarization algorithm). If there are more than one reference summary, ∑r ensures we repeat the process over all reference summaries.

The denominator simply counts the total number of bigrams in all reference summaries. This is the process for one document-summary pair. You repeat the process for all documents, and average all the scores and that gives you a ROUGE-N score. So a higher score would mean that on average there is a high overlap of n-grams between your summaries and the references.

   Example:

   S1. police killed the gunman
   
   S2. police kill the gunman
   
   S3. the gunman kill police

S1 is the reference and S2 and S3 are candidates. Note S2 and S3 both have one overlapping bigram with the reference, so they have the same ROUGE-2 score, although S2 should be better. An additional ROUGE-L score deals with this, where L stands for Longest Common Subsequence. In S2, the first word and last two words match the reference, so it scores 3/4, whereas S3 only matches the bigram, so scores 2/4.

like image 34
thrinadhn Avatar answered Nov 16 '22 03:11

thrinadhn