Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I create a Stream of regex matches?

I am trying to parse standard input and extract every string that matches with a specific pattern, count the number of occurrences of each match, and print the results alphabetically. This problem seems like a good match for the Streams API, but I can't find a concise way to create a stream of matches from a Matcher.

I worked around this problem by implementing an iterator over the matches and wrapping it into a Stream, but the result is not very readable. How can I create a stream of regex matches without introducing additional classes?

public class PatternCounter
{
    static private class MatcherIterator implements Iterator<String> {
        private final Matcher matcher;
        public MatcherIterator(Matcher matcher) {
            this.matcher = matcher;
        }
        public boolean hasNext() {
            return matcher.find();
        }
        public String next() {
            return matcher.group(0);
        }
    }

    static public void main(String[] args) throws Throwable {
        Pattern pattern = Pattern.compile("[a-zA-Z0-9.!#$%&’*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\\.[a-zA-Z0-9-]+)");

        new TreeMap<String, Long>(new BufferedReader(new InputStreamReader(System.in))
            .lines().map(line -> {
                Matcher matcher = pattern.matcher(line);
                return StreamSupport.stream(
                        Spliterators.spliteratorUnknownSize(new MatcherIterator(matcher), Spliterator.ORDERED), false);
            }).reduce(Stream.empty(), Stream::concat).collect(groupingBy(o -> o, counting()))
        ).forEach((k, v) -> {
            System.out.printf("%s\t%s\n",k,v);
        });
    }
}
like image 350
Alfredo Diaz Avatar asked Jan 26 '15 10:01

Alfredo Diaz


People also ask

How do you do a regex match?

To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).

How does regex matcher work?

A regex pattern matches a target string. The pattern is composed of a sequence of atoms. An atom is a single point within the regex pattern which it tries to match to the target string. The simplest atom is a literal, but grouping parts of the pattern to match an atom will require using ( ) as metacharacters.

Does regex match anything?

Matching a Single Character Using Regex ' dot character in a regular expression matches a single character without regard to what character it is. The matched character can be an alphabet, a number or, any special character.

Can I use regular expression in a byte stream?

Regular Expression support is common in many development tools and applications. Although .NET supports regular expression string search via the RegEx class, it has no support for byte Streams. We developed a Stream Regular Expression search class as part of a larger effort to scan incoming email received by a POP3 configured BizTalk Receive Port.

How do you use a regex in a form?

A regex usually comes within this form / abc /, where the search pattern is delimited by two slash characters /. At the end we can specify a flag with these values (we can also combine them each other): g (global) does not return after the first match, restarting the subsequent searches from the end of the previous match.

Can you match non-printable characters in regex?

Notice that you can match also non-printable characters like tabs , new-lines , carriage returns . We are learning how to construct a regex but forgetting a fundamental concept: flags. A regex usually comes within this form / abc /, where the search pattern is delimited by two slash characters /.

What is [regex]::matches()?

Using $matches like we did in the previous posts means we have to write a lot of looping and if statements. With [regex]::matches() we can condense all that and it could work on a big blob of text instead of just a list of individual lines. This means that if there is more than 1 match per line we can still get it!


3 Answers

Well, in Java 8, there is Pattern.splitAsStream which will provide a stream of items split by a delimiter pattern but unfortunately no support method for getting a stream of matches.

If you are going to implement such a Stream, I recommend implementing Spliterator directly rather than implementing and wrapping an Iterator. You may be more familiar with Iterator but implementing a simple Spliterator is straight-forward:

final class MatchItr extends Spliterators.AbstractSpliterator<String> {
    private final Matcher matcher;
    MatchItr(Matcher m) {
        super(m.regionEnd()-m.regionStart(), ORDERED|NONNULL);
        matcher=m;
    }
    public boolean tryAdvance(Consumer<? super String> action) {
        if(!matcher.find()) return false;
        action.accept(matcher.group());
        return true;
    }
}

You may consider overriding forEachRemaining with a straight-forward loop, though.


If I understand your attempt correctly, the solution should look more like:

Pattern pattern = Pattern.compile(
                 "[a-zA-Z0-9.!#$%&’*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\\.[a-zA-Z0-9-]+)");

try(BufferedReader br=new BufferedReader(System.console().reader())) {

    br.lines()
      .flatMap(line -> StreamSupport.stream(new MatchItr(pattern.matcher(line)), false))
      .collect(Collectors.groupingBy(o->o, TreeMap::new, Collectors.counting()))
      .forEach((k, v) -> System.out.printf("%s\t%s\n",k,v));
}

Java 9 provides a method Stream<MatchResult> results() directly on the Matcher. But for finding matches within a stream, there’s an even more convenient method on Scanner. With that, the implementation simplifies to

try(Scanner s = new Scanner(System.console().reader())) {
    s.findAll(pattern)
     .collect(Collectors.groupingBy(MatchResult::group,TreeMap::new,Collectors.counting()))
     .forEach((k, v) -> System.out.printf("%s\t%s\n",k,v));
}

This answer contains a back-port of Scanner.findAll that can be used with Java 8.

like image 90
Holger Avatar answered Oct 11 '22 20:10

Holger


Going off of Holger's solution, we can support arbitrary Matcher operations (such as getting the nth group) by having the user provide a Function<Matcher, String> operation. We can also hide the Spliterator as an implementation detail, so that callers can just work with the Stream directly. As a rule of thumb StreamSupport should be used by library code, rather than users.

public class MatcherStream {
  private MatcherStream() {}

  public static Stream<String> find(Pattern pattern, CharSequence input) {
    return findMatches(pattern, input).map(MatchResult::group);
  }

  public static Stream<MatchResult> findMatches(
      Pattern pattern, CharSequence input) {
    Matcher matcher = pattern.matcher(input);

    Spliterator<MatchResult> spliterator = new Spliterators.AbstractSpliterator<MatchResult>(
        Long.MAX_VALUE, Spliterator.ORDERED|Spliterator.NONNULL) {
      @Override
      public boolean tryAdvance(Consumer<? super MatchResult> action) {
        if(!matcher.find()) return false;
        action.accept(matcher.toMatchResult());
        return true;
      }};

    return StreamSupport.stream(spliterator, false);
  }
}

You can then use it like so:

MatcherStream.find(Pattern.compile("\\w+"), "foo bar baz").forEach(System.out::println);

Or for your specific task (borrowing again from Holger):

try(BufferedReader br = new BufferedReader(System.console().reader())) {
  br.lines()
    .flatMap(line -> MatcherStream.find(pattern, line))
    .collect(Collectors.groupingBy(o->o, TreeMap::new, Collectors.counting()))
    .forEach((k, v) -> System.out.printf("%s\t%s\n", k, v));
}
like image 31
dimo414 Avatar answered Oct 11 '22 18:10

dimo414


If you want to use a Scanner together with regular expressions using the findWithinHorizon method you could also convert a regular expression into a stream of strings. Here we use a stream builder which is very convenient to use during a conventional while loop.

Here is an example:

private Stream<String> extractRulesFrom(String text, Pattern pattern, int group) {
    Stream.Builder<String> builder = Stream.builder();
    try(Scanner scanner = new Scanner(text)) {
        while (scanner.findWithinHorizon(pattern, 0) != null) {
            builder.accept(scanner.match().group(group));
        }
    }
    return builder.build();
} 
like image 1
gil.fernandes Avatar answered Oct 11 '22 18:10

gil.fernandes