Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split java.util.stream.Stream

I have a text file that contains URLs and emails. I need to extract all of them from the file. Each URL and email can be found more then once, but result shouldn't contain duplicates. I can extract all URLs using the following code:

Files.lines(filePath).
    .map(urlPattern::matcher)
    .filter(Matcher::find)
    .map(Matcher::group)
    .distinct();

I can extract all emails using the following code:

Files.lines(filePath).
    .map(emailPattern::matcher)
    .filter(Matcher::find)
    .map(Matcher::group)
    .distinct();

Can I extract all URLs and emails reading the stream returned by Files.lines(filePath) only one time? Something like splitting stream of lines to stream of URLs and stream of emails.

like image 687
york.beta Avatar asked May 13 '15 10:05

york.beta


People also ask

How do I split a stream into two?

If we want to split a stream in two, we can use partitioningBy from the Collectors class. It takes a Predicate and returns a Map that groups elements that satisfied the predicate under the Boolean true key and the rest under false.

Can we use streams on string?

A stringstream associates a string object with a stream allowing you to read from the string as if it were a stream (like cin). To use stringstream, we need to include sstream header file. The stringstream class is extremely useful in parsing input.


2 Answers

You can use partitioningBy collector, though it's still not very elegant solution.

Map<Boolean, List<String>> map = Files.lines(filePath)
        .filter(str -> urlPattern.matcher(str).matches() ||
                       emailPattern.matcher(str).matches())
        .distinct()
        .collect(Collectors.partitioningBy(str -> urlPattern.matcher(str).matches()));
List<String> urls = map.get(true);
List<String> emails = map.get(false);

If you don't want to apply regexp twice, you can make it using the intermediate pair object (for example, SimpleEntry):

public static String classify(String str) {
    return urlPattern.matcher(str).matches() ? "url" : 
        emailPattern.matcher(str).matches() ? "email" : null;
}

Map<String, Set<String>> map = Files.lines(filePath)
        .map(str -> new AbstractMap.SimpleEntry<>(classify(str), str))
        .filter(e -> e.getKey() != null)
        .collect(Collectors.groupingBy(e -> e.getKey(),
            Collectors.mapping(e -> e.getValue(), Collectors.toSet())));

Using my free StreamEx library the last step would be shorter:

Map<String, Set<String>> map = StreamEx.of(Files.lines(filePath))
        .mapToEntry(str -> classify(str), Function.identity())
        .nonNullKeys()
        .grouping(Collectors.toSet());
like image 120
Tagir Valeev Avatar answered Oct 23 '22 03:10

Tagir Valeev


You can perform the matching within a Collector:

Map<String,Set<String>> map=Files.lines(filePath)
    .collect(HashMap::new,
        (hm,line)-> {
            Matcher m=emailPattern.matcher(line);
            if(m.matches())
              hm.computeIfAbsent("mail", x->new HashSet<>()).add(line);
            else if(m.usePattern(urlPattern).matches())
              hm.computeIfAbsent("url", x->new HashSet<>()).add(line);
        },
        (m1,m2)-> m2.forEach((k,v)->m1.merge(k, v,
                                     (s1,s2)->{s1.addAll(s2); return s1;}))
    );
Set<String> mail=map.get("mail"), url=map.get("url");

Note that this can easily be adapted to find multiple matches within a line:

Map<String,Set<String>> map=Files.lines(filePath)
    .collect(HashMap::new,
        (hm,line)-> {
            Matcher m=emailPattern.matcher(line);
            while(m.find())
              hm.computeIfAbsent("mail", x->new HashSet<>()).add(m.group());
            m.usePattern(urlPattern).reset();
            while(m.find())
              hm.computeIfAbsent("url", x->new HashSet<>()).add(m.group());
        },
        (m1,m2)-> m2.forEach((k,v)->m1.merge(k, v,
                                     (s1,s2)->{s1.addAll(s2); return s1;}))
    );
like image 35
Holger Avatar answered Oct 23 '22 04:10

Holger