Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular Expressions on Punctuation

Tags:

java

string

regex

So I'm completely new to regular expressions, and I'm trying to use Java's java.util.regex to find punctuation in input strings. I won't know what kind of punctuation I might get ahead of time, except that (1) !, ?, ., ... are all valid puncutation, and (2) "<" and ">" mean something special, and don't count as punctuation. The program itself builds phrases pseudo-randomly, and I want to strip off the punctuation at the end of a sentence before it goes through the random process.

I can match entire words with any punctuation, but the matcher just gives me indexes for that word. In other words:

Pattern p = Pattern.compile("(.*\\!)*?"); Matcher m = p.matcher([some input string]); 

will grab any words with a "!" on the end. For example:

String inputString = "It is a warm Summer day!"; Pattern p = Pattern.compile("(.*\\!)*?"); Matcher m = p.matcher(inputString); String match = inputString.substring(m.start(), m.end()); 

results in --> String match ~ "day!"

But I want to have Matcher index just the "!", so I can just split it off.

I could probably make cases, and use String.substring(...) for each kind of punctuation I might get, but I'm hoping there's some mistake in my use of regular expressions to do this.

like image 850
Mister R2 Avatar asked Jul 28 '12 22:07

Mister R2


People also ask

How do you use punctuation in regular expressions?

Some punctuation has special meaning in RegEx. It can get confusing if you are searching for things question marks, periods, and parentheses. For example, a period means “match any character.” The easiest way to get around this is to “escape” the character.

What is regular expression for dot?

In regular expressions, the dot or period is one of the most commonly used metacharacters. Unfortunately, it is also the most commonly misused metacharacter. The dot matches a single character, without caring what that character is. The only exception are line break characters.

How do you indicate a period in regex?

The period (.) represents the wildcard character. Any character (except for the newline character) will be matched by a period in a regular expression; when you literally want a period in a regular expression you need to precede it with a backslash.

Is colon used in regex?

A colon has no special meaning in Regular Expressions, it just matches a literal colon.


2 Answers

Java does support POSIX character classes in a roundabout way. For punctuation, the Java equivalent of [:punct:] is \p{Punct}.

Please see the following link for details.

Here is a concrete, working example that uses the expression in the comments

import java.util.regex.Matcher; import java.util.regex.Pattern;  public class RegexFindPunctuation {      public static void main(String[] args) {         Pattern p = Pattern.compile("\\p{Punct}");          Matcher m = p.matcher("One day! when I was walking. I found your pants? just kidding...");         int count = 0;         while (m.find()) {             count++;             System.out.println("\nMatch number: " + count);             System.out.println("start() : " + m.start());             System.out.println("end()   : " + m.end());             System.out.println("group() : " + m.group());         }     } } 
like image 108
EdgeCase Avatar answered Sep 30 '22 12:09

EdgeCase


I would try a character class regex similar to

"[.!?\\-]" 

Add whatever characters you wish to match inside the []s. Be careful to escape any characters that might have a special meaning to the regex parser.

You then have to iterate through the matches by using Matcher.find() until it returns false.

like image 34
Code-Apprentice Avatar answered Sep 30 '22 11:09

Code-Apprentice