Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java 8 Stream filtering and grouping by same expensive method call

I'm looking for a way to optimize a Stream processing in a clean way.

I have something like that:

try (Stream<Path> stream = Files.list(targetDir)) {
    Map<String, List<Path>> targetDirFilteredAndMapped = stream.parallel()                                                                                                
        .filter(path -> sd.containsKey(md5(path)))                                                                                                                    
        .collect(Collectors.groupingBy(path -> md5(path)));
} catch (IOException ioe) { // manage exception }

and since the md5 function is quite expensive, I was wondering if there's a way to invoke it only once per file.

Any suggestions?

like image 978
Gibraltar Avatar asked Sep 29 '16 07:09

Gibraltar


2 Answers

You can create some PathWrapper object that contains a Path instance and its corresponding md5(path).

public class PathWrapper
{
    Path path;
    String md5; // not sure if it's a String
    public PathWrapper(Path path) {
        this.path = path;
        this.md5 = md5(path);
    }
    public Path getPath() {return path;}
    public String getMD5() {return md5;}
}

Then map your stream to Stream<PathWrapper>:

try (Stream<Path> stream = Files.list(targetDir)) {
    Map<String, List<Path>> targetDirFilteredAndMapped =
        stream.parallel() 
              .map(PathWrapper::new)
              .filter(path -> sd.containsKey(path.getMD5()))                                                                                                                    
              .collect(Collectors.groupingBy(PathWrapper::getMD5,
                                             Collectors.mapping(PathWrapper::getPath,
                                                                Collectors.toList())));
} catch (IOException ioe) { /* manage exception */ }
like image 178
Eran Avatar answered Nov 03 '22 20:11

Eran


If the md5 operation is truly dominating the performance, you may consider leaving off the filtering here and just remove the nonmatching groups afterwards:

try(Stream<Path> stream = Files.list(targetDir)) {
    Map<String, List<Path>> targetDirFilteredAndMapped = stream.parallel()
        .collect(Collectors.groupingBy(p -> md5(p), HashMap::new, Collectors.toList()));
    targetDirFilteredAndMapped.keySet().retainAll(sd.keySet());
} catch (IOException ioe) { 
    // manage exception
}

This, of course, temporarily requires more memory. If this is a concern, using a more complicated solution, like shown in the other answers, is unavoidable.

like image 7
Holger Avatar answered Nov 03 '22 21:11

Holger