Regular expression for counting words in a sentence

Question

public static int getWordCount(String sentence) {
    return sentence.split("(([a-zA-Z0-9]([-][_])*[a-zA-Z0-9])+)", -1).length
         + sentence.replaceAll("([[a-z][A-Z][0-9][\W][-][_]]*)", "").length() - 1;
}

My intention is to count the number of words in a sentence. The input to this function is the the lengthy sentence. It may have 255 words.

The word should take hyphens or underscores in between
Function should only count valid words means special character should not be counted eg. &&&& or #### should not count as a word.

The above regular expression is working fine, but when hyphen or underscore comes in between the word eg: co-operation, the count returning as 2, it should be 1. Can anyone please help?

Willem Van Onsem · Accepted Answer

Instead of using .split and .replaceAll which are quite expensive operations, please use an approach with constant memory usage.

Based on your specifications, you seem to look for the following regex:

[\w-]+

Next you can use this approach to count the number of matches:

public static int getWordCount(String sentence) {
    Pattern pattern = Pattern.compile("[\w-]+");
    Matcher  matcher = pattern.matcher(sentence);
    int count = 0;
    while (matcher.find())
        count++;
    return count;
}

online jDoodle demo.

This approach works in (more) constant memory: when splitting, the program constructs an array, which is basically useless, since you never inspect the content of the array.

If you don't want words to start or end with hyphens, you can use the following regex:

\w+([-]\w+)*

Regular expression for counting words in a sentence

Tags:

java

regex

neena

1 Answers

Willem Van Onsem

Recent Activity

Donate For Us

Regular expression for counting words in a sentence

Tags:

java

regex

neena

1 Answers

Willem Van Onsem

Related questions

Recent Activity

Donate For Us