I'm looking for a way to optimize a Stream
processing in a clean way.
I have something like that:
try (Stream<Path> stream = Files.list(targetDir)) {
Map<String, List<Path>> targetDirFilteredAndMapped = stream.parallel()
.filter(path -> sd.containsKey(md5(path)))
.collect(Collectors.groupingBy(path -> md5(path)));
} catch (IOException ioe) { // manage exception }
and since the md5
function is quite expensive, I was wondering if there's a way to invoke it only once per file.
Any suggestions?
You can create some PathWrapper
object that contains a Path
instance and its corresponding md5(path)
.
public class PathWrapper
{
Path path;
String md5; // not sure if it's a String
public PathWrapper(Path path) {
this.path = path;
this.md5 = md5(path);
}
public Path getPath() {return path;}
public String getMD5() {return md5;}
}
Then map your stream to Stream<PathWrapper>
:
try (Stream<Path> stream = Files.list(targetDir)) {
Map<String, List<Path>> targetDirFilteredAndMapped =
stream.parallel()
.map(PathWrapper::new)
.filter(path -> sd.containsKey(path.getMD5()))
.collect(Collectors.groupingBy(PathWrapper::getMD5,
Collectors.mapping(PathWrapper::getPath,
Collectors.toList())));
} catch (IOException ioe) { /* manage exception */ }
If the md5
operation is truly dominating the performance, you may consider leaving off the filtering here and just remove the nonmatching groups afterwards:
try(Stream<Path> stream = Files.list(targetDir)) {
Map<String, List<Path>> targetDirFilteredAndMapped = stream.parallel()
.collect(Collectors.groupingBy(p -> md5(p), HashMap::new, Collectors.toList()));
targetDirFilteredAndMapped.keySet().retainAll(sd.keySet());
} catch (IOException ioe) {
// manage exception
}
This, of course, temporarily requires more memory. If this is a concern, using a more complicated solution, like shown in the other answers, is unavoidable.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With