Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HashSet vs ArrayList CPU usage is high

I am having 104k string values , out of which 89k are unique. I would like to check a string is present in this list or not.

This is my class and its method which hold all these records.

public class TestClass {
    private static TestClass singletonObj = null;
    private List<String> stringList= null;

    public static synchronized TestClass getInstance() {
        if(singletonObj == null) {
            singletonObj = new TestClass();
        }
        return singletonObj;
    }


    public boolean isValidString(String token) {
        if(stringList == null) {
            init();
        }
        if(stringList != null && token != null && !token.isEmpty())
            return stringList.contains(token.toLowerCase());
        return false;
    }

    private init() {
     stringList = new ArrayList<String>();
     // put all 104k values in this data structure.
    }
}

My application tries to use this isValidString() method concurrently with around 20 requests per second. This works fine but when i tried to change the Data structure to HashSet, the CPU usage goes very high. As per my understanding Hashset should perform better[o(1)] than ArrayList[o(n)]. Can anyone explain me why is this happening ?

like image 996
Sthita Avatar asked Aug 06 '15 08:08

Sthita


1 Answers

I created a simple class to spawn 20 threads that access your dictionary checker every second as per the bottom of this post.

I cannot replicate your results - but this might be because of the input data that I have access to. I have used your TestClass implementation importing ~130,000 words from the English Open Word List (EOWL). No ongoing high CPU use is seen with either ArrayList or HashSet as the type of stringList.

My guess is that your problem is due to your input data. I tried adding my input dictionary twice to create duplication - obviously with the ArrayList this just makes the list twice as long, but with the HashSet, it means the code is throwing out duplicates. You note that about 1/5 of your input data is duplicate. With 1/2 duplicates in my tests, I do see a slight increased CPU for about 2 seconds and then it drops back to almost nothing once stringList is initialised.

This "blip" could be more prolonged if your input strings are more complex than the single words I'm using. So maybe that is your problem. Alternatively - maybe you have some other code that wraps this part that is hogging CPU.

N.B. I would caution you as others have in comments about your implementation of init. In my experiments, I saw that many threads could call the dictionary check before the dictionary had fully initialised, giving inconsistent results for the same test word. Why not call it from your constructor, if it is to be a singleton object?

Code listings

Your TestClass with some input data code:

import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Scanner;

public class TestClass {
    private static TestClass singletonObj = null;
    //private List<String> stringList = null;

    private HashSet<String> stringList = null;

    public static synchronized TestClass getInstance() {
        if (singletonObj == null) {
            singletonObj = new TestClass();
        }
        return singletonObj;
    }

    public boolean isValidString(String token) {
        if (stringList == null) {
            init();
        }
        if (stringList != null && token != null && !token.isEmpty())
            return stringList.contains(token.toLowerCase());
        return false;
    }

    private void init() {
        String dictDir = "C:\\Users\\Richard\\Documents\\EOWL_CSVs";
        File[] csvs = (new File(dictDir)).listFiles();
        stringList = new HashSet<String>();
        Scanner inFile = null;

        for (File f : csvs) {
            try {
                inFile = new Scanner(new FileReader(f));
            } catch (FileNotFoundException e) {
                e.printStackTrace();
                System.exit(1);
            }

            while (inFile.hasNext()) {
                stringList.add(inFile.next().toLowerCase()
                        .replaceAll("[^a-zA-Z ]", ""));
            }
            inFile.close();
        }

        System.out.println("Dictionary initialised with " + stringList.size()
                + " members");
    }
}

Thread to access it:

import java.io.FileNotFoundException;

public class DictChecker extends Thread {

    TestClass t = null;
    public static int classId = 0;
    String className = null;
    
    
    public void doWork()
    {
        String testString = "Baby";
        if (t.isValidString(testString))
        {
            System.out.println("Got a valid string " + testString + " in class " + className);
        }
        else
        {
            System.out.println(testString + " not in the dictionary");
        }
    }
    
    public void run()
    {
        while (true)
        {
            try {
                DictChecker.sleep(1000);
            } catch (InterruptedException e) {
                e.printStackTrace();
            }
            doWork();
        }
    }
    
    public DictChecker()
    {
        t = TestClass.getInstance();
        className = "dChecker" + classId;
        classId += 1;
        System.out.println("Initialised " + className + " in thread " + this.getName());
    }
    
    public static void main(String[] args) throws FileNotFoundException
    {
        for (int i = 0; i < 20; i++)
        {
             (new DictChecker()).start();
             try {
                DictChecker.sleep(50);//simply to distribute load over the second
            } catch (InterruptedException e) {
                e.printStackTrace();
            } 
        }
    }
}
like image 127
J Richard Snape Avatar answered Sep 28 '22 17:09

J Richard Snape