I am comparing substrings in two large text files. Very simple, tokenizing into two token containers, comparing with 2 for loops. Performance is disastrous! Does anybody have an advice or idea how to improve performance?
for (int s = 0; s < txtA.TokenContainer.size(); s++) {
String strTxtA = txtA.getSubStr(s);
strLengthA = txtA.getNumToken(s);
if (strLengthA >= dp.getMinStrLength()) {
int tokenFileB = 1;
for (int t = 0; t < txtB.TokenContainer.size(); t++) {
String strTxtB = txtB.getSubStr(t);
strLengthB = txtB.getNumToken(t);
if (strTxtA.equalsIgnoreCase(strTxtB)) {
try {
subStrTemp = new SubStrTemp(
txtA.ID, txtB.ID, tokenFileA, tokenFileB,
(tokenFileA + strLengthA - 1),
(tokenFileB + strLengthB - 1));
if (subStrContainer.contains(subStrTemp) == false) {
subStrContainer.addElement(subStrTemp);
}
} catch (Exception ex) {
logger.error("error");
}
}
tokenFileB += strLengthB;
}
tokenFileA += strLengthA;
}
}
Generally my code reading two large Strings with Java Tokonizer into containers A and B. And then trying to compare substrings.Possision of Substrgs which are existing in both strings to store into a Vector. But performance is awful, also don't really know how to solve it with HashMap.
Your main problem is that you go through all txtB for each token in txtA.
You should store informations on token from txtA (in a HashMap for instance) and then in a second loop (but not a nested one) you compare the strings with the existing one in the Map.
On the same topic :
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With