Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read in all Files in a sub-directory using Files.walk exactly once?

I am attempting to read in all of the files in all subdirectories of a directory. I have the logic written, but I am doing something slightly wrong because it is reading in each file twice.

To test my implementation, I created a directory with three subdirectories in it each having 10 documents in them. That should be 30 documents in total.

Here is my code for testing that I am reading in the documents correctly:

String basePath = "src/test/resources/20NG";
Driver driver = new Driver();
List<Document> documents = driver.readInCorpus(basePath);
assertEquals(3 * 10, documents.size());

Where Driver#readInCorpus has the following code:

public List<Document> readInCorpus(String directory)
{
    try (Stream<Path> paths = Files.walk(Paths.get(directory)))
    {
        return paths
                .filter(Files::isDirectory)
                .map(this::readAllDocumentsInDirectory)
                .flatMap(Collection::stream)
                .collect(Collectors.toList());
    }
    catch (IOException e)
    {
        e.printStackTrace();
    }
    return Collections.emptyList();
}

private List<Document> readAllDocumentsInDirectory(Path path)
{
    try (Stream<Path> paths = Files.walk(path))
    {
        return paths
                .filter(Files::isRegularFile)
                .map(this::readInDocumentFromFile)
                .collect(Collectors.toList());
    }
    catch (IOException e)
    {
        e.printStackTrace();
    }
    return Collections.emptyList();
}

private Document readInDocumentFromFile(Path path)
{
    String fileName = path.getFileName().toString();
    String outputClass = path.getParent().getFileName().toString();
    List<String> words = EmailProcessor.readEmail(path);
    return new Document(fileName, outputClass, words);
}

When I run the test case, I see that the assertEquals failed because there were 60 documents retrieved, not 30, which is incorrect. When I debugged, the documents were all inserted into the list once, and then inserted again in the exact same order.

Where in my code am I reading in the documents twice?

like image 400
Cache Staheli Avatar asked Dec 23 '22 17:12

Cache Staheli


2 Answers

The problem here is in Files.walk(path) method. You're using it incorrectly. So it traverses your file system like a tree. For example you have 3 folders - /parent and 2 children /parent/first, /parent/second. Files.walk("/parent") will give you tree Paths for each folder - parent and 2 children, and actually this happens in your readInCorpus method.

And then for each Path you're calling second method readAllDocumentsInDirectory and same story here it's traversing folders like a tree.

For readAllDocumentsInDirectory with /parent path it will return all documents from both children folders /parent/first and /parent/second and then you have 2 more calls of readAllDocumentsInDirectory for /parent/first, /parent/second that return documents from both folders.

Thats why you have your documents doubled. To fix it you should only call method readAllDocumentsInDirectory with Paths.get(basePath) argument and remove readInCorpus method.

like image 80
Orest Avatar answered Jan 26 '23 00:01

Orest


It looks like this comes from a misunderstanding of how Paths and Files.walk work. In Driver#readInCorpus, you have the following stream operation:

return paths
        .filter(Files::isRegularFile)
        .map(this::readInDocumentFromFile)
        .collect(Collectors.toList());

Your mapping function (this::readInDocumentFromFile) reads all of the documents from every directory in each path in the Paths.walk stream, which includes the top level directory and the subdirectories.

This means that all of the files that are below the starting directory in the path are read once, and then re-read when going through the subdirectories.

This isn't entirely clear from looking at the streams, but you should take a look at How to debug stream().map(...) with lambda expressions? to see how to better debug streams and avoid this problem in the future.

That means that you can skip the intermediate step of calling Driver#readAllDocumentsInDirectory and just have this in Driver#readInCorpus:

public List<Document> readInCorpus(String directory)
{
    try (Stream<Path> paths = Files.walk(Paths.get(directory)))
    {
        return paths
                .filter(Files::isRegularFile)
                .map(this::readInDocumentFromFile)
                .collect(Collectors.toList());
    }
    catch (IOException e)
    {
        e.printStackTrace();
    }
    return Collections.emptyList();
}
like image 39
Cache Staheli Avatar answered Jan 25 '23 22:01

Cache Staheli