Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular expression removing all words shorter than n

Tags:

java

regex

Well, I'm looking for a regexp in Java that deletes all words shorter than 3 characters. I thought something like \s\w{1,2}\s would grab all the 1 and 2 letter words (a whitespace, one to two word characters and another whitespace), but it just doesn't work. Where am I wrong?

like image 491
janesconference Avatar asked Sep 26 '09 00:09

janesconference


3 Answers

I've got it working fairly well, but it took two passes.

public static void main(String[] args) {
    String passage = "Well, I'm looking for a regexp in Java that deletes all words shorter than 3 characters.";
    System.out.println(passage);

    passage = passage.replaceAll("\\b[\\w']{1,2}\\b", "");
    passage = passage.replaceAll("\\s{2,}", " ");

    System.out.println(passage);
}

The first pass replaces all words containing less than three characters with a single space. Note that I had to include the apostrophe in the character class to eliminate because the word "I'm" was giving me trouble without it. You may find other special characters in your text that you also need to include here.

The second pass is necessary because the first pass left a few spots where there were double spaces. This just collapses all occurrences of 2 or more spaces down to one. It's up to you whether you need to keep this or not, but I think it's better with the spaces collapsed.

Output:

Well, I'm looking for a regexp in Java that deletes all words shorter than 3 characters.

Well, looking for regexp Java that deletes all words shorter than characters.

like image 107
Bill the Lizard Avatar answered Nov 13 '22 22:11

Bill the Lizard


If you don't want the whitespace matched, you might want to use

\b\w{1,2}\b

to get the word boundaries.

That's working for me in RegexBuddy using the Java flavor; for the test string

"The dog is fun a cat"

it highlights "is" and "a". Similarly for words at the beginning/end of a line.

You might want to post a code sample.

(And, as GameFreak just posted, you'll still end up with double spaces.)

EDIT:

\b\w{1,2}\b\s?

is another option. This will partially fix the space-stripping issue, although words at the end of a string or followed by punctuation can still cause issues. For example, "A dog is fun no?" becomes "dog fun ?" In any case, you're still going to have issues with capitalization (dog should now be Dog).

like image 27
TrueWill Avatar answered Nov 13 '22 21:11

TrueWill


Try: \b\w{1,2}\b although you will still have to get rid of the double spaces that will show up.

like image 45
GameFreak Avatar answered Nov 13 '22 22:11

GameFreak