Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Collect HashSet / Java 8 / Regex Pattern / Stream API

Recently I change version of the JDK 8 instead 7 of my project and now I overwrite some code snippets using new features that came with Java 8.

final Matcher mtr = Pattern.compile(regex).matcher(input);

HashSet<String> set = new HashSet<String>() {{
    while (mtr.find()) add(mtr.group().toLowerCase());
}};

How I can write this code using Stream API ?

like image 535
Anton Dozortsev Avatar asked Jul 09 '14 18:07

Anton Dozortsev


3 Answers

A Matcher-based spliterator implementation can be quite simple if you reuse the JDK-provided Spliterators.AbstractSpliterator:

public class MatcherSpliterator extends AbstractSpliterator<String[]>
{
  private final Matcher m;

  public MatcherSpliterator(Matcher m) {
    super(Long.MAX_VALUE, ORDERED | NONNULL | IMMUTABLE);
    this.m = m;
  }

  @Override public boolean tryAdvance(Consumer<? super String[]> action) {
    if (!m.find()) return false;
    final String[] groups = new String[m.groupCount()+1];
    for (int i = 0; i <= m.groupCount(); i++) groups[i] = m.group(i);
    action.accept(groups);
    return true;
  }
}

Note that the spliterator provides all matcher groups, not just the full match. Also note that this spliterator supports parallelism because AbstractSpliterator implements a splitting policy.

Typically you will use a convenience stream factory:

public static Stream<String[]> matcherStream(Matcher m) {
  return StreamSupport.stream(new MatcherSpliterator(m), false);
}

This gives you a powerful basis to concisely write all kinds of complex regex-oriented logic, for example:

private static final Pattern emailRegex = Pattern.compile("([^,]+?)@([^,]+)");
public static void main(String[] args) {
  final String emails = "[email protected], [email protected], [email protected]";
  System.out.println("User has e-mail accounts on these domains: " +
      matcherStream(emailRegex.matcher(emails))
      .map(gs->gs[2])
      .collect(joining(", ")));
}

Which prints

User has e-mail accounts on these domains: gmail.com, yahoo.com, tijuana.com

For completeness, your code will be rewritten as

Set<String> set = matcherStream(mtr).map(gs->gs[0].toLowerCase()).collect(toSet());
like image 120
Marko Topolnik Avatar answered Nov 04 '22 22:11

Marko Topolnik


Marko's answer demonstrates how to get matches into a stream using a Spliterator. Well done, give that man a big +1! Seriously, make sure you upvote his answer before you even consider upvoting this one, since this one is entirely derivative of his.

I have only a small bit to add to Marko's answer, which is that instead of representing the matches as an array of strings (with each array element representing a match group), the matches are better represented as a MatchResult which is a type invented for this purpose. Thus the result would be a Stream<MatchResult> instead of Stream<String[]>. The code gets a little simpler, too. The tryAdvance code would be

    if (m.find()) {
        action.accept(m.toMatchResult());
        return true;
    } else {
        return false;
    }

The map call in his email-matching example would change to

    .map(mr -> mr.group(2))

and the OP's example would be rewritten as

Set<String> set = matcherStream(mtr)
                      .map(mr -> mr.group(0).toLowerCase())
                      .collect(toSet());

Using MatchResult gives a bit more flexibility in that it also provides offsets of match groups within the string, which could be useful for certain applications.

like image 29
Stuart Marks Avatar answered Nov 04 '22 22:11

Stuart Marks


I don't think you can turn this into a Stream without writing your own Spliterator, but, I don't know why you would want to.

Matcher.find() is a state changing operation on the Matcher object so running each find() in a parallel stream would produce inconsistent results. Running the stream in serial wouldn't have better performance that the Java 7 equivalent and would be harder to understand.

like image 8
dkatzel Avatar answered Nov 04 '22 20:11

dkatzel