Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I capture a multiline pattern using a regular expressions in java?

Tags:

java

regex

I have a text file that I need to parse using regular expressions. The text that I need to capture is in multiline groups like this:

truck
zDoug
Doug's house
(123) 456-7890
[email protected]
30
61234.56
8/10/2003

vehicle
eRob
Rob's house
(987) 654-3210
[email protected]

For this example I need to capture truck followed by the next seven lines.In other words, in this "block" I have 8 groups. This is what I've tried but it will not capture the next line:

(truck)\n(\w).

NOTE: I'm using the program RegExr to test my regex before I port it to Java.

like image 416
lampShade Avatar asked Mar 03 '11 03:03

lampShade


People also ask

What is multiline in regular expression?

Multiline option, or the m inline option, enables the regular expression engine to handle an input string that consists of multiple lines. It changes the interpretation of the ^ and $ language elements so that they match the beginning and end of a line, instead of the beginning and end of the input string.

What is the regex mode modifier for multiline?

The "m" modifier specifies a multiline match.

What is pattern multiline Java?

Pattern. MULTILINE or (? m) tells Java to accept the anchors ^ and $ to match at the start and end of each line (otherwise they only match at the start/end of the entire string).

What is multiline flag in regex?

The m flag indicates that a multiline input string should be treated as multiple lines. For example, if m is used, ^ and $ change from matching at only the start or end of the entire string to the start or end of any line within the string. The set accessor of multiline is undefined .


1 Answers

(?m)^truck(?:(?:\r\n|[\r\n]).+$)*

This assumes the whole text has been read into a single string (i.e., you're not reading a file line-by-line), but it doesn't assume the line separator is always \n, as your code does. At the minimum you should allow for \r\n and \r as well, which is what (?:\r\n|[\r\n]) does. But it still matches only one separator, so the match stops before the double line separator at the end of the block.

Once you've matched a block of data, you can split it on the line separators to get the individual lines. Here's an example:

Pattern p0 = Pattern.compile("(?m)^truck(?:(?:\r\n|[\r\n]).+$)*");
Matcher m = p0.matcher(data);
while (m.find())
{
  String fullMatch = m.group();
  int n = 0;
  for (String s : fullMatch.split("\r\n|[\r\n]"))
  {
    System.out.printf("line %d: %s%n", n++, s);
  }
}

output:

line 0: truck
line 1: zDoug
line 2: Doug's house
line 3: (123) 456-7890
line 4: [email protected]
line 5: 30
line 6: 61234.56
line 7: 8/10/2003

I'm also assuming each line of data contains at least one character, and that the blank lines between data block are really empty--i.e., no spaces, TABs, or other invisible characters.

(BTW: To test that regex in RegExr, remove the (?m) and check the multiline box instead. RegExr is powered by ActionScript, so the rules are a little different. For a Java-powered regex tester, check out RegexPlanet.)

like image 76
Alan Moore Avatar answered Oct 28 '22 14:10

Alan Moore