I am trying to parse standard input and extract every string that matches with a specific pattern, count the number of occurrences of each match, and print the results alphabetically. This problem seems like a good match for the Streams API, but I can't find a concise way to create a stream of matches from a Matcher.
I worked around this problem by implementing an iterator over the matches and wrapping it into a Stream, but the result is not very readable. How can I create a stream of regex matches without introducing additional classes?
public class PatternCounter
{
static private class MatcherIterator implements Iterator<String> {
private final Matcher matcher;
public MatcherIterator(Matcher matcher) {
this.matcher = matcher;
}
public boolean hasNext() {
return matcher.find();
}
public String next() {
return matcher.group(0);
}
}
static public void main(String[] args) throws Throwable {
Pattern pattern = Pattern.compile("[a-zA-Z0-9.!#$%&’*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\\.[a-zA-Z0-9-]+)");
new TreeMap<String, Long>(new BufferedReader(new InputStreamReader(System.in))
.lines().map(line -> {
Matcher matcher = pattern.matcher(line);
return StreamSupport.stream(
Spliterators.spliteratorUnknownSize(new MatcherIterator(matcher), Spliterator.ORDERED), false);
}).reduce(Stream.empty(), Stream::concat).collect(groupingBy(o -> o, counting()))
).forEach((k, v) -> {
System.out.printf("%s\t%s\n",k,v);
});
}
}
To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).
A regex pattern matches a target string. The pattern is composed of a sequence of atoms. An atom is a single point within the regex pattern which it tries to match to the target string. The simplest atom is a literal, but grouping parts of the pattern to match an atom will require using ( ) as metacharacters.
Matching a Single Character Using Regex ' dot character in a regular expression matches a single character without regard to what character it is. The matched character can be an alphabet, a number or, any special character.
Regular Expression support is common in many development tools and applications. Although .NET supports regular expression string search via the RegEx class, it has no support for byte Streams. We developed a Stream Regular Expression search class as part of a larger effort to scan incoming email received by a POP3 configured BizTalk Receive Port.
A regex usually comes within this form / abc /, where the search pattern is delimited by two slash characters /. At the end we can specify a flag with these values (we can also combine them each other): g (global) does not return after the first match, restarting the subsequent searches from the end of the previous match.
Notice that you can match also non-printable characters like tabs , new-lines , carriage returns . We are learning how to construct a regex but forgetting a fundamental concept: flags. A regex usually comes within this form / abc /, where the search pattern is delimited by two slash characters /.
Using $matches like we did in the previous posts means we have to write a lot of looping and if statements. With [regex]::matches() we can condense all that and it could work on a big blob of text instead of just a list of individual lines. This means that if there is more than 1 match per line we can still get it!
Well, in Java 8, there is Pattern.splitAsStream
which will provide a stream of items split by a delimiter pattern but unfortunately no support method for getting a stream of matches.
If you are going to implement such a Stream
, I recommend implementing Spliterator
directly rather than implementing and wrapping an Iterator
. You may be more familiar with Iterator
but implementing a simple Spliterator
is straight-forward:
final class MatchItr extends Spliterators.AbstractSpliterator<String> {
private final Matcher matcher;
MatchItr(Matcher m) {
super(m.regionEnd()-m.regionStart(), ORDERED|NONNULL);
matcher=m;
}
public boolean tryAdvance(Consumer<? super String> action) {
if(!matcher.find()) return false;
action.accept(matcher.group());
return true;
}
}
You may consider overriding forEachRemaining
with a straight-forward loop, though.
If I understand your attempt correctly, the solution should look more like:
Pattern pattern = Pattern.compile(
"[a-zA-Z0-9.!#$%&’*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\\.[a-zA-Z0-9-]+)");
try(BufferedReader br=new BufferedReader(System.console().reader())) {
br.lines()
.flatMap(line -> StreamSupport.stream(new MatchItr(pattern.matcher(line)), false))
.collect(Collectors.groupingBy(o->o, TreeMap::new, Collectors.counting()))
.forEach((k, v) -> System.out.printf("%s\t%s\n",k,v));
}
Java 9 provides a method Stream<MatchResult> results()
directly on the Matcher
. But for finding matches within a stream, there’s an even more convenient method on Scanner
. With that, the implementation simplifies to
try(Scanner s = new Scanner(System.console().reader())) {
s.findAll(pattern)
.collect(Collectors.groupingBy(MatchResult::group,TreeMap::new,Collectors.counting()))
.forEach((k, v) -> System.out.printf("%s\t%s\n",k,v));
}
This answer contains a back-port of Scanner.findAll
that can be used with Java 8.
Going off of Holger's solution, we can support arbitrary Matcher
operations (such as getting the nth group) by having the user provide a Function<Matcher, String>
operation. We can also hide the Spliterator
as an implementation detail, so that callers can just work with the Stream
directly. As a rule of thumb StreamSupport
should be used by library code, rather than users.
public class MatcherStream {
private MatcherStream() {}
public static Stream<String> find(Pattern pattern, CharSequence input) {
return findMatches(pattern, input).map(MatchResult::group);
}
public static Stream<MatchResult> findMatches(
Pattern pattern, CharSequence input) {
Matcher matcher = pattern.matcher(input);
Spliterator<MatchResult> spliterator = new Spliterators.AbstractSpliterator<MatchResult>(
Long.MAX_VALUE, Spliterator.ORDERED|Spliterator.NONNULL) {
@Override
public boolean tryAdvance(Consumer<? super MatchResult> action) {
if(!matcher.find()) return false;
action.accept(matcher.toMatchResult());
return true;
}};
return StreamSupport.stream(spliterator, false);
}
}
You can then use it like so:
MatcherStream.find(Pattern.compile("\\w+"), "foo bar baz").forEach(System.out::println);
Or for your specific task (borrowing again from Holger):
try(BufferedReader br = new BufferedReader(System.console().reader())) {
br.lines()
.flatMap(line -> MatcherStream.find(pattern, line))
.collect(Collectors.groupingBy(o->o, TreeMap::new, Collectors.counting()))
.forEach((k, v) -> System.out.printf("%s\t%s\n", k, v));
}
If you want to use a Scanner
together with regular expressions using the findWithinHorizon
method you could also convert a regular expression into a stream of strings.
Here we use a stream builder which is very convenient to use during a conventional while
loop.
Here is an example:
private Stream<String> extractRulesFrom(String text, Pattern pattern, int group) {
Stream.Builder<String> builder = Stream.builder();
try(Scanner scanner = new Scanner(text)) {
while (scanner.findWithinHorizon(pattern, 0) != null) {
builder.accept(scanner.match().group(group));
}
}
return builder.build();
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With