I am reading about 600 text files, and then parsing each file individually and add all the terms to a map so i can know the frequency of each word within the 600 files. (about 400MB).
My parser functions includes the following steps (ordered):
It is taking me about 8min and 48 seconds, in a dual-core 2.2GHz, 2GB Ram. I would like advice on how to speed this process up. Should I expect it to be this slow? And if possible, how can I know (in netbeans), which functions are taking more time to execute?
unique words found: 398752.
CODE:
File file = new File(dir);
String[] files = file.list();
for (int i = 0; i < files.length; i++) {
BufferedReader br = new BufferedReader(
new InputStreamReader(
new BufferedInputStream(
new FileInputStream(dir + files[i])), encoding));
try {
String line;
while ((line = br.readLine()) != null) {
parsedString = parseString(line); // parse the string
m = stringToMap(parsedString, m);
}
} finally {
br.close();
}
}
EDIT: Check this:
![enter image description here][1]
I don't know what to conclude.
EDIT: 80% TIME USED WITH THIS FUNCTION
public String [] parseString(String sentence){
// separators; ,:;'"\/<>()[]*~^ºª+&%$ etc..
String[] parts = sentence.toLowerCase().split("[,\\s\\-:\\?\\!\\«\\»\\'\\´\\`\\\"\\.\\\\\\/()<>*º;+&ª%\\[\\]~^]");
Map<String, String> o = new HashMap<String, String>(); // save the hyphened words, aaa-bbb like Map<aaa,bbb>
Pattern pattern = Pattern.compile("(?<![A-Za-zÁÉÍÓÚÀÃÂÊÎÔÛáéíóúàãâêîôû-])[A-Za-zÁÉÍÓÚÀÃÂÊÎÔÛáéíóúàãâêîôû]+-[A-Za-zÁÉÍÓÚÀÃÂÊÎÔÛáéíóúàãâêîôû]+(?![A-Za-z-])");
Matcher matcher = pattern.matcher(sentence);
// Find all matches like this: ("aaa-bb or bbb-cc") and put it to map to later add this words to the original map and discount the single words "aaa-aa" like "aaa" and "aa"
for(int i=0; matcher.find(); i++){
String [] tempo = matcher.group().split("-");
o.put(tempo[0], tempo[1]);
}
//System.out.println("words: " + o);
ArrayList temp = new ArrayList();
temp.addAll(Arrays.asList(parts));
for (Map.Entry<String, String> entry : o.entrySet()) {
String key = entry.getKey();
String value = entry.getValue();
temp.add(key+"-"+value);
if(temp.indexOf(key)!=-1){
temp.remove(temp.indexOf(key));
}
if(temp.indexOf(value)!=-1){
temp.remove(temp.indexOf(value));
}
}
String []strArray = new String[temp.size()];
temp.toArray(strArray);
return strArray;
}
600 files, each file about 0.5MB
EDIT3#- The pattern is no longer compiling each time a line is read. The new images are:

2: 
Be sure to increase your heap size, if you haven't already, using -Xmx. For this app, the impact may be striking.
The parts of your code that are likely to have the largest performance impact are the ones that are executed the most - which are the parts you haven't shown.
Update after memory screenshot
Look at all those Pattern$6 objects in the screenshot. I think you're recompiling the pattern a lot - maybe for every line. That would take a lot of time.
Update 2 - after code added to question.
Yup - two patterns compiled on every line - the explicit one, and also the "-" in the split (much cheaper, of course). I wish they hadn't added split() to String without it taking a compiled pattern as an argument. I see some other things that could be improved, but nothing else like the big compile. Just compile the pattern once, outside this function, maybe as a static class member.
Try to use to single regex that has a group that matches each word that is within tags - so a single regex could be used for your entire input and there would be not separate "split" stage.
Otherwise your approach seems reasonable, although I don't understand what you mean by "get the String [] ..." - I thought you were using an ArrayList. In any event, try to minimize the creation of objects, for both construction cost and garbage collection cost.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With