I am attempting to read in all of the files in all subdirectories of a directory. I have the logic written, but I am doing something slightly wrong because it is reading in each file twice.
To test my implementation, I created a directory with three subdirectories in it each having 10 documents in them. That should be 30 documents in total.
Here is my code for testing that I am reading in the documents correctly:
String basePath = "src/test/resources/20NG";
Driver driver = new Driver();
List<Document> documents = driver.readInCorpus(basePath);
assertEquals(3 * 10, documents.size());
Where Driver#readInCorpus
has the following code:
public List<Document> readInCorpus(String directory)
{
try (Stream<Path> paths = Files.walk(Paths.get(directory)))
{
return paths
.filter(Files::isDirectory)
.map(this::readAllDocumentsInDirectory)
.flatMap(Collection::stream)
.collect(Collectors.toList());
}
catch (IOException e)
{
e.printStackTrace();
}
return Collections.emptyList();
}
private List<Document> readAllDocumentsInDirectory(Path path)
{
try (Stream<Path> paths = Files.walk(path))
{
return paths
.filter(Files::isRegularFile)
.map(this::readInDocumentFromFile)
.collect(Collectors.toList());
}
catch (IOException e)
{
e.printStackTrace();
}
return Collections.emptyList();
}
private Document readInDocumentFromFile(Path path)
{
String fileName = path.getFileName().toString();
String outputClass = path.getParent().getFileName().toString();
List<String> words = EmailProcessor.readEmail(path);
return new Document(fileName, outputClass, words);
}
When I run the test case, I see that the assertEquals
failed because there were 60 documents retrieved, not 30, which is incorrect. When I debugged, the documents were all inserted into the list once, and then inserted again in the exact same order.
Where in my code am I reading in the documents twice?
The problem here is in Files.walk(path)
method. You're using it incorrectly. So it traverses your file system like a tree.
For example you have 3 folders - /parent
and 2 children /parent/first
, /parent/second
.
Files.walk("/parent")
will give you tree Paths
for each folder - parent and 2 children, and actually this happens in your readInCorpus
method.
And then for each Path
you're calling second method readAllDocumentsInDirectory
and same story here it's traversing folders like a tree.
For readAllDocumentsInDirectory
with /parent
path it will return all documents from both children folders /parent/first
and /parent/second
and then you have 2 more calls of readAllDocumentsInDirectory
for /parent/first
, /parent/second
that return documents from both folders.
Thats why you have your documents doubled. To fix it you should only call method readAllDocumentsInDirectory
with Paths.get(basePath)
argument and remove readInCorpus
method.
It looks like this comes from a misunderstanding of how Paths
and Files.walk
work. In Driver#readInCorpus
, you have the following stream operation:
return paths
.filter(Files::isRegularFile)
.map(this::readInDocumentFromFile)
.collect(Collectors.toList());
Your mapping function (this::readInDocumentFromFile
) reads all of the documents from every directory in each path in the Paths.walk
stream, which includes the top level directory and the subdirectories.
This means that all of the files that are below the starting directory in the path are read once, and then re-read when going through the subdirectories.
This isn't entirely clear from looking at the streams, but you should take a look at How to debug stream().map(...) with lambda expressions? to see how to better debug streams and avoid this problem in the future.
That means that you can skip the intermediate step of calling Driver#readAllDocumentsInDirectory
and just have this in Driver#readInCorpus
:
public List<Document> readInCorpus(String directory)
{
try (Stream<Path> paths = Files.walk(Paths.get(directory)))
{
return paths
.filter(Files::isRegularFile)
.map(this::readInDocumentFromFile)
.collect(Collectors.toList());
}
catch (IOException e)
{
e.printStackTrace();
}
return Collections.emptyList();
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With