Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java Regex is including new line in match

Tags:

java

regex

I'm trying to match a regular expression to textbook definitions that I get from a website. The definition always has the word with a new line followed by the definition. For example:

Zither
 Definition: An instrument of music used in Austria and Germany It has from thirty to forty wires strung across a shallow sounding board which lies horizontally on a table before the performer who uses both hands in playing on it Not to be confounded with the old lute shaped cittern or cithern

In my attempts to get just the word (in this case "Zither") I keep getting the newline character.

I tried both ^(\w+)\s and ^(\S+)\s without much luck. I thought that maybe ^(\S+)$ would work, but that doesn't seem to successfully match the word at all. I've been testing with rubular, http://rubular.com/r/LPEHCnS0ri; which seems to successfully match all my attempts the way I want, despite the fact that Java doesn't.

Here's my snippet

String str = ...; //Here the string is assigned a word and definition taken from the internet like given in the example above.
Pattern rgx = Pattern.compile("^(\\S+)$");
Matcher mtch = rgx.matcher(str);
if (mtch.find()) {
    String result = mtch.group();
    terms.add(new SearchTerm(result, System.nanoTime()));
}

This is easily solved by triming the resulting string, but that seems like it should be unnecessary if I'm already using a regular expression.

All help is greatly appreciated. Thanks in advance!

like image 999
Paul Nelson Baker Avatar asked Aug 15 '13 20:08

Paul Nelson Baker


3 Answers

Try using the Pattern.MULTILINE option

Pattern rgx = Pattern.compile("^(\\S+)$", Pattern.MULTILINE);

This causes the regex to recognise line delimiters in your string, otherwise ^ and $ just match the start and end of the string.

Although it makes no difference for this pattern, the Matcher.group() method returns the entire match, whereas the Matcher.group(int) method returns the match of the particular capture group (...) based on the number you specify. Your pattern specifies one capture group which is what you want captured. If you'd included \s in your Pattern as you wrote you tried, then Matcher.group() would have included that whitespace in its return value.

like image 176
Adrian Pronk Avatar answered Oct 23 '22 01:10

Adrian Pronk


With regular expressions the first group is always the complete matching string. In your case you want group 1, not group 0.

So changing mtch.group() to mtch.group(1) should do the trick:

 String str = ...; //Here the string is assigned a word and definition taken from the internet like given in the example above.
 Pattern rgx = Pattern.compile("^(\\w+)\s");
 Matcher mtch = rgx.matcher(str);
 if (mtch.find()) {
     String result = mtch.group(1);
     terms.add(new SearchTerm(result, System.nanoTime()));
 }
like image 40
Mike Dinescu Avatar answered Oct 23 '22 02:10

Mike Dinescu


A late response, but if you are not using Pattern and Matcher, you can use this alternative of DOTALL in your regex string

(?s)[Your Expression]

Basically (?s) also tells dot to match all characters, including line breaks

Detailed information: http://www.vogella.com/tutorials/JavaRegularExpressions/article.html

like image 2
Varun Garg Avatar answered Oct 23 '22 01:10

Varun Garg