Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular expression match a sentence [closed]

Tags:

java

regex

How can I match a sentence of the form "Hello world" or "Hello World". The sentence may contain "- / digit 0-9". Any information will be very helpful to me. Thank you.

like image 706
Tapas Bose Avatar asked Apr 05 '11 14:04

Tapas Bose


People also ask

How do you match a sentence with regex?

My sentence must start with either one or more whitespaces/tabs. (tabs and spaces can be bunched together before any non-whitespace phrase of characters appears). Each word after the first must be separated by a whitespace. And yes, the sentence must end with a punctuation.

What does '$' mean in regex?

$ means "Match the end of the string" (the position after the last character in the string). Both are called anchors and ensure that the entire string is matched instead of just a substring.

What is \r and \n in regex?

Matches a form-feed character. \n. Matches a newline character. \r. Matches a carriage return character.


1 Answers

This one will do a pretty good job. My definition of a sentence: A sentence begins with a non-whitespace and ends with a period, exclamation point or a question mark (or end of string). There may be a closing quote following the ending punctuation.

[^.!?\s][^.!?]*(?:[.!?](?!['"]?\s|$)[^.!?]*)*[.!?]?['"]?(?=\s|$)

import java.util.regex.*;
public class TEST {
    public static void main(String[] args) {
        String subjectString = 
        "This is a sentence. " +
        "So is \"this\"! And is \"this?\" " +
        "This is 'stackoverflow.com!' " +
        "Hello World";
        String[] sentences = null;
        Pattern re = Pattern.compile(
            "# Match a sentence ending in punctuation or EOS.\n" +
            "[^.!?\\s]    # First char is non-punct, non-ws\n" +
            "[^.!?]*      # Greedily consume up to punctuation.\n" +
            "(?:          # Group for unrolling the loop.\n" +
            "  [.!?]      # (special) inner punctuation ok if\n" +
            "  (?!['\"]?\\s|$)  # not followed by ws or EOS.\n" +
            "  [^.!?]*    # Greedily consume up to punctuation.\n" +
            ")*           # Zero or more (special normal*)\n" +
            "[.!?]?       # Optional ending punctuation.\n" +
            "['\"]?       # Optional closing quote.\n" +
            "(?=\\s|$)", 
            Pattern.MULTILINE | Pattern.COMMENTS);
        Matcher reMatcher = re.matcher(subjectString);
        while (reMatcher.find()) {
            System.out.println(reMatcher.group());
        } 
    }
}

Here is the output:

This is a sentence.
So is "this"!
And is "this?"
This is 'stackoverflow.com!'
Hello World

Matching all of these correctly (with the last sentence having no ending punctuation), turns out to be not so easy as it seems!

like image 191
ridgerunner Avatar answered Oct 14 '22 09:10

ridgerunner