Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Constructing regex pattern to match sentence

Tags:

java

regex

I'm trying to write a regex pattern that will match any sentence that begins with multiple or one tab and/or whitespace. For example, I want my regex pattern to be able to match " hello there I like regex!" but so I'm scratching my head on how to match words after "hello". So far I have this:

    String REGEX = "(?s)(\\p{Blank}+)([a-z][ ])*";
    Pattern PATTERN = Pattern.compile(REGEX);
    Matcher m = PATTERN.matcher("         asdsada  adf adfah.");
    if (m.matches()) {
        System.out.println("hurray!");
    }

Any help would be appreciated. Thanks.

like image 905
user1923 Avatar asked Dec 02 '13 04:12

user1923


People also ask

How do you match a phrase in regex?

^ matches the start of a new line. Allows the regex to match the phrase if it appears at the beginning of a line, with no characters before it.

What does '$' mean in regex?

$ means "Match the end of the string" (the position after the last character in the string). Both are called anchors and ensure that the entire string is matched instead of just a substring.

What does \\ s+ mean in regex?

The plus sign + is a greedy quantifier, which means one or more times. For example, expression X+ matches one or more X characters. Therefore, the regular expression \s matches a single whitespace character, while \s+ will match one or more whitespace characters.

What is difference [] and () in regex?

[] denotes a character class. () denotes a capturing group. [a-z0-9] -- One character that is in the range of a-z OR 0-9. (a-z0-9) -- Explicit capture of a-z0-9 .


2 Answers

An example regex to match sentences by the definition: "A sentence is a series of characters, starting with at lease one whitespace character, that ends in one of ., ! or ?" is as follows:

\s+[^.!?]*[.!?]

Regular expression visualization

Note that newline characters will also be included in this match.

like image 176
Taylor Hx Avatar answered Sep 30 '22 09:09

Taylor Hx


A sentence starts with a word boundary (hence \b) and ends with one or more terminators. Thus:

\b[^.!?]+[.!?]+

https://regex101.com/r/7DdyM1/1

This gives pretty accurate results. However, it will not handle fractional numbers. E.g. This sentence will be interpreted as two sentences:

The value of PI is 3.141...
like image 26
l33t Avatar answered Sep 30 '22 09:09

l33t