Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex in Java: match groups until first symbol occurrence

Tags:

java

regex

My string looks like this:

"Chitkara DK, Rawat DJY, Talley N. The epidemiology of childhood recurrent abdominal pain in Western countries: a systematic review. Am J Gastroenterol. 2005;100(8):1868-75. DOI."

What I want is to get letters in uppercase (as separate words only) until first dot, to get: DK DJY N. But not other characters after, like J DOI.

Here`s my part of code for Java class Pattern:

\\b[A-Z]{1,3}\\b

Is there a general option in regex to stop matching after certain character?

like image 508
Vitaliy Avatar asked Feb 25 '17 20:02

Vitaliy


People also ask

What is difference [] and () in regex?

[] denotes a character class. () denotes a capturing group. [a-z0-9] -- One character that is in the range of a-z OR 0-9. (a-z0-9) -- Explicit capture of a-z0-9 .

What does \\ mean in Java regex?

Backslashes in Java. The backslash \ is an escape character in Java Strings. That means backslash has a predefined meaning in Java. You have to use double backslash \\ to define a single backslash. If you want to define \w , then you must be using \\w in your regex.

How do you match a character sequence in regex?

To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).

What is difference between matches () and find () in Java regex?

Difference between matches() and find() in Java RegexThe matches() method returns true If the regular expression matches the whole text. If not, the matches() method returns false. Whereas find() search for the occurrence of the regular expression passes to Pattern.


2 Answers

You can make use of the contionous matching using \G and extract your desired matches from the first capturing group:

(?:\\G|^)[^.]+?\\b([A-Z]{1,3})\\b

You need to use the MULTILINE flag to use this in a multiline context. If your content is always a single line you may drop the |^ from your pattern.

See https://regex101.com/r/JXIu21/3

Note that regex101 uses a PCRE pattern, but all features used are also available in Java regex.

like image 148
Sebastian Proske Avatar answered Sep 24 '22 13:09

Sebastian Proske


Sebastian Proske's answer is great, but it's often easier (and more readable) to split complex parsing tasks into separate steps. We can split your goal into two separate steps and thereby create a much simpler and more clearly-correct solution, using your original pattern.

private static final Pattern UPPER_CASE_ABBV_PATTERN = Pattern.compile("\\b[A-Z]{1,3}\\b");

public static List<String> getAbbreviationsInFirstSentence(String input) {
  // isolate the first sentence, since that's all we care about
  String firstSentence = input.split("\\.")[0];
  // then look for matches in the first sentence
  Matcher m = UPPER_CASE_ABBV_PATTERN.matcher(firstSentence);
  List<String> results = new ArrayList<>();
  while (m.find()) {
    results.add(m.group());
  }
  return results;
}
like image 32
dimo414 Avatar answered Sep 24 '22 13:09

dimo414