Currently i am tasked with making a tool that can check whether a link is correct or not using java. The link is fed from Jericho HTML Parser, and my job is only to check whether the file is exist / the link is correct or not. That part is done, the hard part is to optimize it, since my code run (i have to say) rather sluggishly on 65ms per run
public static String checkRelativeURL(String originalFileLoc, String relativeLoc){
StringBuilder sb = new StringBuilder();
String absolute = Common.relativeToAbsolute(originalFileLoc, relativeLoc); //built in function to replace the link from relative link to absolute path
sb.append(absolute);
sb.append("\t");
try {
Path path = Paths.get(absolute);
sb.append(Files.exists(path));
}catch (InvalidPathException | NullPointerException ex) {
sb.append(false);
}
sb.append("\t");
return sb.toString();
}
and on this line it took 65 ms
Path path = Paths.get(absolute);
sb.append(Files.exists(path));
I have tried using
File file = new File(absolute);
sb.append(file.isFile());
It's still ran around 65~100ms.
So is there any other faster way to check whether a file exists or not other than this?
Since i am processing more than 70k html files and every milliseconds counts, thanks :(
EDIT:
I tried listing all the files into some List, and it doesn't really helps since it take more than 20mins just to list all the file....
The code that i use to list all the file
static public void listFiles2(String filepath){
Path path = Paths.get(filepath);
File file = null;
String pathString = new String();
try {
if(path.toFile().isDirectory()){
DirectoryStream<Path> stream = Files.newDirectoryStream(path);
for(Path entry : stream){
file = entry.toFile();
pathString = entry.toString();
if(file.isDirectory()){
listFiles2(pathString);
}
if (file.isFile()){
filesInProject.add(pathString);
System.out.println(pathString);
}
}
stream.close();
}
} catch (IOException e) {
e.printStackTrace();
}
}
If you know in advance the target OS set (usually it is the case), ultimately the fastest way will be to list so many files through a shell, by invoking a process e.g. using Runtime.exec.
On Windows you can do with
dir /s /b
On Linux
ls -R -1
You can check what is the OS and use appropriate command (error or resort to directory stream if not supported).
If you wish simplicity and don't need to report a progress, you can avoid dealing with the process IO and store the list to a temporary file e.g. ls -R -1 > /tmp/filelist.txt
. Alternatively, you can read from the process output directly. Read with a buffered stream, a reader or alike, with large enough buffer.
On SSD it will complete in a blink of an eye and on modern HDD in seconds (half million files is not a problem with this approach).
Once you have the list, you can approach it differently depending on maximum files count and memory requirements. If requirements are loose, e.g. desktop program, you can do with very simple code e.g. pre-loading the complete file list to a HashSet and check existence when needed. Shortening path by removing common root will require much less memory. You can also reduce memory by keeping only filename hash instead of full name (common root removal will probably reduce more).
Or you can optimize it further if you wish, the question just reduces now to a problem of checking existense of a string in a list of strings stored in memory or file, which has many well known optimal solutions.
Bellow is very loose, simplistic sample for Windows. It executes dir on HDD (not SSD) drive root with ~400K files, reads the list and benchmarks (well, kind of) time and memory for string set and md5 set approaches:
public static void main(String args[]) throws Exception {
final Runtime rt = Runtime.getRuntime();
System.out.println("mem " + (rt.totalMemory() - rt.freeMemory())
/ (1024 * 1024) + " Mb");
long time = System.currentTimeMillis();
// windows command: cd to t:\ and run recursive dir
Process p = rt.exec("cmd /c \"t: & dir /s /b > filelist.txt\"");
if (p.waitFor() != 0)
throw new Exception("command has failed");
System.out.println("done executing shell, took "
+ (System.currentTimeMillis() - time) + "ms");
System.out.println();
File f = new File("T:/filelist.txt");
// load into hash set
time = System.currentTimeMillis();
Set<String> fileNames = new HashSet<String>(500000);
try (BufferedReader reader = new BufferedReader(new InputStreamReader(
new FileInputStream(f), StandardCharsets.UTF_8),
50 * 1024 * 1024)) {
for (String line = reader.readLine(); line != null; line = reader
.readLine()) {
fileNames.add(line);
}
}
System.out.println(fileNames.size() + " file names loaded took "
+ (System.currentTimeMillis() - time) + "ms");
System.gc();
System.out.println("mem " + (rt.totalMemory() - rt.freeMemory())
/ (1024 * 1024) + " Mb");
time = System.currentTimeMillis();
// check files
for (int i = 0; i < 70_000; i++) {
StringBuilder fileToCheck = new StringBuilder();
while (fileToCheck.length() < 256)
fileToCheck.append(Double.toString(Math.random()));
if (fileNames.contains(fileToCheck))
System.out.println("to prevent optimization, never executes");
}
System.out.println();
System.out.println("hash set 70K checks took "
+ (System.currentTimeMillis() - time) + "ms");
System.gc();
System.out.println("mem " + (rt.totalMemory() - rt.freeMemory())
/ (1024 * 1024) + " Mb");
// Test memory/performance with MD5 hash set approach instead of full
// names
time = System.currentTimeMillis();
Set<String> nameHashes = new HashSet<String>(50000);
MessageDigest md5 = MessageDigest.getInstance("MD5");
for (String name : fileNames) {
String nameMd5 = new String(md5.digest(name
.getBytes(StandardCharsets.UTF_8)), StandardCharsets.UTF_8);
nameHashes.add(nameMd5);
}
System.out.println();
System.out.println(fileNames.size() + " md5 hashes created, took "
+ (System.currentTimeMillis() - time) + "ms");
fileNames.clear();
fileNames = null;
System.gc();
Thread.sleep(100);
System.gc();
System.out.println("mem " + (rt.totalMemory() - rt.freeMemory())
/ (1024 * 1024) + " Mb");
time = System.currentTimeMillis();
// check files
for (int i = 0; i < 70_000; i++) {
StringBuilder fileToCheck = new StringBuilder();
while (fileToCheck.length() < 256)
fileToCheck.append(Double.toString(Math.random()));
String md5ToCheck = new String(md5.digest(fileToCheck.toString()
.getBytes(StandardCharsets.UTF_8)), StandardCharsets.UTF_8);
if (nameHashes.contains(md5ToCheck))
System.out.println("to prevent optimization, never executes");
}
System.out.println("md5 hash set 70K checks took "
+ (System.currentTimeMillis() - time) + "ms");
System.gc();
System.out.println("mem " + (rt.totalMemory() - rt.freeMemory())
/ (1024 * 1024) + " Mb");
}
Output:
mem 3 Mb
done executing shell, took 5686ms
403108 file names loaded took 382ms
mem 117 Mb
hash set 70K checks took 283ms
mem 117 Mb
403108 md5 hashes created, took 486ms
mem 52 Mb
md5 hash set 70K checks took 366ms
mem 48 Mb
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With