How can I speed up my Java text file parser?

Question

I am reading about 600 text files, and then parsing each file individually and add all the terms to a map so i can know the frequency of each word within the 600 files. (about 400MB).

My parser functions includes the following steps (ordered):

find text between two tags, which is the relevant text to read in each file.
lowecase all the text
string.split with multiple delimiters.
creating an arrayList with words like this: "aaa-aa", then adding to the string splitted above, and discounting "aaa" and "aa" to the String []. (i did this because i wanted "-" to be a delimiter, but i also wanted "aaa-aa" to be one word only, and not "aaa" and "aa".
get the String [] and map to a Map = new HashMap ... (word, frequency)
print everything.

It is taking me about 8min and 48 seconds, in a dual-core 2.2GHz, 2GB Ram. I would like advice on how to speed this process up. Should I expect it to be this slow? And if possible, how can I know (in netbeans), which functions are taking more time to execute?

unique words found: 398752.

CODE:

File file = new File(dir);
String[] files = file.list();

for (int i = 0; i < files.length; i++) {
    BufferedReader br = new BufferedReader(
        new InputStreamReader(
            new BufferedInputStream(
                new FileInputStream(dir + files[i])), encoding));
    try {
        String line;
        while ((line = br.readLine()) != null) {
            parsedString = parseString(line); // parse the string
            m = stringToMap(parsedString, m);
        }
    } finally {
        br.close();
    }
}

EDIT: Check this:

![enter image description here][1]

I don't know what to conclude.

EDIT: 80% TIME USED WITH THIS FUNCTION

    public String [] parseString(String sentence){
         // separators; ,:;'"/<>()[]*~^ºª+&%$ etc..
        String[] parts = sentence.toLowerCase().split("[,\s\-:\?\!\«\»\'\´\`\\"\.\\\/()<>*º;+&ª%~^]");

        Map<String, String> o = new HashMap<String, String>(); // save the hyphened words, aaa-bbb like Map<aaa,bbb>

        Pattern pattern = Pattern.compile("(?<![A-Za-zÁÉÍÓÚÀÃÂÊÎÔÛáéíóúàãâêîôû-])[A-Za-zÁÉÍÓÚÀÃÂÊÎÔÛáéíóúàãâêîôû]+-[A-Za-zÁÉÍÓÚÀÃÂÊÎÔÛáéíóúàãâêîôû]+(?![A-Za-z-])");
        Matcher matcher = pattern.matcher(sentence);

    // Find all matches like this: ("aaa-bb or bbb-cc") and put it to map to later add this words to the original map and discount the single words "aaa-aa" like "aaa" and "aa"
        for(int i=0; matcher.find(); i++){
           String [] tempo = matcher.group().split("-");
           o.put(tempo[0], tempo[1]);
        }
        //System.out.println("words: " + o);


        ArrayList temp = new ArrayList();
        temp.addAll(Arrays.asList(parts));

        for (Map.Entry<String, String> entry : o.entrySet()) {
            String key = entry.getKey();
            String value = entry.getValue();
            temp.add(key+"-"+value);
            if(temp.indexOf(key)!=-1){
                temp.remove(temp.indexOf(key));
            }
            if(temp.indexOf(value)!=-1){
                temp.remove(temp.indexOf(value));
            }
        }


        String []strArray = new String[temp.size()];
        temp.toArray(strArray);
                return strArray;

  }

600 files, each file about 0.5MB

EDIT3#- The pattern is no longer compiling each time a line is read. The new images are:

enter image description here

2: enter image description here

Ed Staub · Accepted Answer

Be sure to increase your heap size, if you haven't already, using -Xmx. For this app, the impact may be striking.

The parts of your code that are likely to have the largest performance impact are the ones that are executed the most - which are the parts you haven't shown.

Update after memory screenshot

Look at all those Pattern$6 objects in the screenshot. I think you're recompiling the pattern a lot - maybe for every line. That would take a lot of time.

Update 2 - after code added to question.

Yup - two patterns compiled on every line - the explicit one, and also the "-" in the split (much cheaper, of course). I wish they hadn't added split() to String without it taking a compiled pattern as an argument. I see some other things that could be improved, but nothing else like the big compile. Just compile the pattern once, outside this function, maybe as a static class member.

Bohemian · Answer

Try to use to single regex that has a group that matches each word that is within tags - so a single regex could be used for your entire input and there would be not separate "split" stage.

Otherwise your approach seems reasonable, although I don't understand what you mean by "get the String [] ..." - I thought you were using an ArrayList. In any event, try to minimize the creation of objects, for both construction cost and garbage collection cost.

How can I speed up my Java text file parser?

Tags:

java

performance

file

parsing

recoInrelax

2 Answers

Ed Staub

Bohemian

Recent Activity

Donate For Us

How can I speed up my Java text file parser?

Tags:

java

performance

file

parsing

recoInrelax

2 Answers

Ed Staub

Bohemian

Related questions

Recent Activity

Donate For Us