Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Count number of sentences using REGEX and ignoring acronyms

Tags:

java

regex

I try to count the number of sentences in a text using a regex. I came up with a regex1 finding all points:

([^.!?\s][^.!?]*)

After that I tried to find most of the acronyms by the following regex2:

([A-Z]+[a-z]{0,3}\.).

However I have a few problems:

  1. If the acronym is at the end of a sentence, it is found (e.g. Since 20,000 BC.) by the regex2 formula. This is not intended, I only want to find acronyms in a sentence.

  2. If we assume, that problem1 is solved, I want to merge both regex formulas together so that the final formula only outputs the real number of sentences. As for example we can consider the following text from Wikipedia:

The National Aeronautics and Space Administration (NASA) is the United States government agency responsible for the civilian space program as well as aeronautics and aerospace research.

President Dwight D. Eisenhower established the National Aeronautics and Space Administration (NASA) in 1958[5] with a distinctly civilian (rather than military) orientation encouraging peaceful applications in space science. The National Aeronautics and Space Act was passed on July 29, 1958, disestablishing NASA's predecessor, the National Advisory Committee for Aeronautics (NACA). The new agency became operational on October 1, 1958.[6][7]

Since that time, most U.S. space exploration efforts have been led by NASA, including the Apollo moon-landing missions, the Skylab space station, and later the Space Shuttle. Currently, NASA is supporting the International Space Station and is overseeing the development of the Orion Multi-Purpose Crew Vehicle, the Space Launch System and Commercial Crew vehicles. The agency is also responsible for the Launch Services Program (LSP) which provides oversight of launch operations and countdown management for unmanned NASA launches.

NASA science is focused on better understanding Earth through the Earth Observing System,[8] advancing heliophysics through the efforts of the Science Mission Directorate's Heliophysics Research Program,[9] exploring bodies throughout the Solar System with advanced robotic spacecraft missions such as New Horizons,[10] and researching astrophysics topics, such as the Big Bang, through the Great Observatories and associated programs.[11] NASA shares data with various national and international organizations such as from the Greenhouse Gases Observing Satellite.

The above text has 9 sentences.

Regex1: 12 matches (D., U., and S. are considered as "full stops")

Regex2: 3 matches (D., U., and S.)

What I need now is a better regex1 formula to only find acronyms in a sentence and then "merge" both regex formulas to receive all sentences.

If merging both formulas is not possible (for any plausible reason), then only consider problem 1 because at the moment my JAVA program use both formulas seperated:

public void breakIntoSentences()
{   
    //Find all points
    Pattern p = Pattern.compile("([^.!?\\s][^.!?]*)");
    Matcher m = p.matcher(content);

    int allPoints = 0;
    while(m.find())
        allPoints++;

    //Find all acronyms with length 0-4
    p = Pattern.compile("([A-Z]+[a-z]{0,3}\\.)");
    m = p.matcher(content);

    int allAcronyms = 0;
    while(m.find())
        allAcronyms++;

    numberOfSentences = allPoints - allAcronyms;        
}

Thank you in advance for the help

like image 238
Flu Avatar asked Nov 10 '22 15:11

Flu


1 Answers

Here's a pattern:

.+?(?:(?<![\s.]\p{Lu})[.!?]|$)

Demo

  • .+? is here just to match a full sentence. If you simply want a count, you can replace it with .
  • (?<![\s.]\p{Lu}) means not preceded by an uppercase letter itself preceded with a space or period. This is used just before [.!?] which checks for an end of sentence. This seems to handle the acronyms right.
  • $ is there just to force the non-greedy .+? at the start to match until the end of the text just in case the text doesn't end with a period.

This regex handles [6][7] as part of the next sentence. If that's not acceptable, you could tweak the pattern a bit by adding [\d\[\]]* just after [.!?].

like image 193
Lucas Trzesniewski Avatar answered Nov 14 '22 22:11

Lucas Trzesniewski