I have a text file that contains URLs and emails. I need to extract all of them from the file. Each URL and email can be found more then once, but result shouldn't contain duplicates. I can extract all URLs using the following code: <pre class="prettyprint"><code>Files.lines(filePath). .map(urlPattern::matcher) .filter(Matcher::find) .map(Matcher::group) .distinct(); </code></pre> I can extract all emails using the following code: <pre class="prettyprint"><code>Files.lines(filePath). .map(emailPattern::matcher) .filter(Matcher::find) .map(Matcher::group) .distinct(); </code></pre> Can I extract all URLs and emails reading the stream returned by <code>Files.lines(filePath)</code> only one time? Something like splitting stream of lines to stream of URLs and stream of emails.

You can perform the matching within a <code>Collector</code>: <pre class="prettyprint"><code>Map<String,Set<String>> map=Files.lines(filePath) .collect(HashMap::new, (hm,line)-> { Matcher m=emailPattern.matcher(line); if(m.matches()) hm.computeIfAbsent("mail", x->new HashSet<>()).add(line); else if(m.usePattern(urlPattern).matches()) hm.computeIfAbsent("url", x->new HashSet<>()).add(line); }, (m1,m2)-> m2.forEach((k,v)->m1.merge(k, v, (s1,s2)->{s1.addAll(s2); return s1;})) ); Set<String> mail=map.get("mail"), url=map.get("url"); </code></pre> Note that this can easily be adapted to find multiple matches within a line: <pre class="prettyprint"><code>Map<String,Set<String>> map=Files.lines(filePath) .collect(HashMap::new, (hm,line)-> { Matcher m=emailPattern.matcher(line); while(m.find()) hm.computeIfAbsent("mail", x->new HashSet<>()).add(m.group()); m.usePattern(urlPattern).reset(); while(m.find()) hm.computeIfAbsent("url", x->new HashSet<>()).add(m.group()); }, (m1,m2)-> m2.forEach((k,v)->m1.merge(k, v, (s1,s2)->{s1.addAll(s2); return s1;})) ); </code></pre>

Split java.util.stream.Stream

Tags:

java

java-8

java-stream

I have a text file that contains URLs and emails. I need to extract all of them from the file. Each URL and email can be found more then once, but result shouldn't contain duplicates. I can extract all URLs using the following code:

Files.lines(filePath).
    .map(urlPattern::matcher)
    .filter(Matcher::find)
    .map(Matcher::group)
    .distinct();

I can extract all emails using the following code:

Files.lines(filePath).
    .map(emailPattern::matcher)
    .filter(Matcher::find)
    .map(Matcher::group)
    .distinct();

Can I extract all URLs and emails reading the stream returned by Files.lines(filePath) only one time? Something like splitting stream of lines to stream of URLs and stream of emails.

687

asked May 13 '15 10:05

york.beta

2 Answers

You can use partitioningBy collector, though it's still not very elegant solution.

Map<Boolean, List<String>> map = Files.lines(filePath)
        .filter(str -> urlPattern.matcher(str).matches() ||
                       emailPattern.matcher(str).matches())
        .distinct()
        .collect(Collectors.partitioningBy(str -> urlPattern.matcher(str).matches()));
List<String> urls = map.get(true);
List<String> emails = map.get(false);

If you don't want to apply regexp twice, you can make it using the intermediate pair object (for example, SimpleEntry):

public static String classify(String str) {
    return urlPattern.matcher(str).matches() ? "url" : 
        emailPattern.matcher(str).matches() ? "email" : null;
}

Map<String, Set<String>> map = Files.lines(filePath)
        .map(str -> new AbstractMap.SimpleEntry<>(classify(str), str))
        .filter(e -> e.getKey() != null)
        .collect(Collectors.groupingBy(e -> e.getKey(),
            Collectors.mapping(e -> e.getValue(), Collectors.toSet())));

Using my free StreamEx library the last step would be shorter:

Map<String, Set<String>> map = StreamEx.of(Files.lines(filePath))
        .mapToEntry(str -> classify(str), Function.identity())
        .nonNullKeys()
        .grouping(Collectors.toSet());

120

answered Oct 23 '22 03:10

Tagir Valeev

You can perform the matching within a Collector:

Map<String,Set<String>> map=Files.lines(filePath)
    .collect(HashMap::new,
        (hm,line)-> {
            Matcher m=emailPattern.matcher(line);
            if(m.matches())
              hm.computeIfAbsent("mail", x->new HashSet<>()).add(line);
            else if(m.usePattern(urlPattern).matches())
              hm.computeIfAbsent("url", x->new HashSet<>()).add(line);
        },
        (m1,m2)-> m2.forEach((k,v)->m1.merge(k, v,
                                     (s1,s2)->{s1.addAll(s2); return s1;}))
    );
Set<String> mail=map.get("mail"), url=map.get("url");

Note that this can easily be adapted to find multiple matches within a line:

Map<String,Set<String>> map=Files.lines(filePath)
    .collect(HashMap::new,
        (hm,line)-> {
            Matcher m=emailPattern.matcher(line);
            while(m.find())
              hm.computeIfAbsent("mail", x->new HashSet<>()).add(m.group());
            m.usePattern(urlPattern).reset();
            while(m.find())
              hm.computeIfAbsent("url", x->new HashSet<>()).add(m.group());
        },
        (m1,m2)-> m2.forEach((k,v)->m1.merge(k, v,
                                     (s1,s2)->{s1.addAll(s2); return s1;}))
    );

answered Oct 23 '22 04:10

Holger

Related questions
                            
                                Encoding/decoding of data between PHP/Java for Android
                            
                                Why do I need to explicitly cast a generic call?
                            
                                How to execute custom SQL query with spring-managed transactional EntityManager
                            
                                Invoke private method with java.lang.invoke.MethodHandle
                            
                                Java get generic type of collection [duplicate]
                            
                                Socket closed exception [duplicate]
                            
                                Java serialization and duplicate objects
                            
                                Can't understand Poisson part of Hash tables from Sun documentation
                            
                                static factory method in interface class java
                            
                                java.lang.NoSuchMethodError: org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Obje
                            
                                How to use system environment variables in log4j.properties?
                            
                                Extending List<T> in Java 8
                            
                                How can I implement abstract static methods in Java?
                            
                                Android gradle src/androidTest/res/layout/mylayout.xml not found in mypackage.R
                            
                                What is the recommended workflow using Liquibase and Git?
                            
                                How can I get the implementation class name based on the interface object in Java
                            
                                How can I compare POJOs by their fields reflectively
                            
                                com.sun.mail.smtp.SMTPSendFailedException: 530 5.7.0 Must issue a STARTTLS command first
                            
                                What is the new accepted way of programmatically creating new drools rules in Drools 6?
                            
                                Is there a way to automatically format long argument lists in Java code so each one is on a separate line?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With