Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Search multiple HashMaps at the same time

tldr: How can I search for an entry in multiple (read-only) Java HashMaps at the same time?


The long version:

I have several dictionaries of various sizes stored as HashMap< String, String >. Once they are read in, they are never to be changed (strictly read-only). I want to check whether and which dictionary had stored an entry with my key.

My code was originally looking for a key like this:

public DictionaryEntry getEntry(String key) {
    for (int i = 0; i < _numDictionaries; i++) {
        HashMap<String, String> map = getDictionary(i);
        if (map.containsKey(key))
             return new DictionaryEntry(map.get(key), i);
    }
    return null;
}

Then it got a little more complicated: my search string could contain typos, or was a variant of the stored entry. Like, if the stored key was "banana", it is possible that I'd look up "bannana" or "a banana", but still would like the entry for "banana" returned. Using the Levenshtein-Distance, I now loop through all dictionaries and each entry in them:

public DictionaryEntry getEntry(String key) {
    for (int i = 0; i < _numDictionaries; i++) {
        HashMap<String, String> map = getDictionary(i);
        for (Map.Entry entry : map.entrySet) {
            // Calculate Levenshtein distance, store closest match etc.
        }
    }
    // return closest match or null.
}    

So far everything works as it should and I'm getting the entry I want. Unfortunately I have to look up around 7000 strings, in five dictionaries of various sizes (~ 30 - 70k entries) and it takes a while. From my processing output I have the strong impression my lookup dominates overall runtime.

My first idea to improve runtime was to search all dictionaries parallely. Since none of the dictionaries is to be changed and no more than one thread is accessing a dictionary at the same time, I don't see any safety concerns.

The question is just: how do I do this? I have never used multithreading before. My search only came up with Concurrent HashMaps (but to my understanding, I don't need this) and the Runnable-class, where I'd have to put my processing into the method run(). I think I could rewrite my current class to fit into Runnable, but I was wondering if there is maybe a simpler method to do this (or how can I do it simply with Runnable, right now my limited understanding thinks I have to restructure a lot).


Since I was asked to share the Levenshtein-Logic: It's really nothing fancy, but here you go:

private int _maxLSDistance = 10;
public Map.Entry getClosestMatch(String key) {
    Map.Entry _closestMatch = null;
    int lsDist;

    if (key == null) {
        return null;
    }

    for (Map.Entry entry : _dictionary.entrySet()) {
        // Perfect match
        if (entry.getKey().equals(key)) {
            return entry;
        }
        // Similar match
        else {
            int dist = StringUtils.getLevenshteinDistance((String) entry.getKey(), key);

            // If "dist" is smaller than threshold and smaller than distance of already stored entry
            if (dist < _maxLSDistance) {
                if (_closestMatch == null || dist < _lsDistance) {
                    _closestMatch = entry;
                    _lsDistance = dist;
                }
            }
        }
    }
    return _closestMatch
}
like image 742
fukiburi Avatar asked Jul 30 '15 11:07

fukiburi


1 Answers

In order to use multi-threading in your case, could be something like:

The "monitor" class, which basically stores the results and coordinates the threads;

public class Results {

    private int nrOfDictionaries = 4; //

    private ArrayList<String> results = new ArrayList<String>();

    public void prepare() {
        nrOfDictionaries = 4;
        results = new ArrayList<String>();
    }

    public synchronized void oneDictionaryFinished() {
        nrOfDictionaries--;
        System.out.println("one dictionary finished");
        notifyAll();
    }

    public synchronized boolean isReady() throws InterruptedException {

        while (nrOfDictionaries != 0) {
            wait();
        }

        return true;
    }

    public synchronized void addResult(String result) {
        results.add(result);
    }

    public ArrayList<String> getAllResults() {
        return results;
    }
}

The Thread it's self, which can be set to search for the specific dictionary:

public class ThreadDictionarySearch extends Thread {

    // the actual dictionary
    private String dictionary;
    private Results results;

    public ThreadDictionarySearch(Results results, String dictionary) {
        this.dictionary = dictionary;
        this.results = results;
    }

    @Override
    public void run() {

        for (int i = 0; i < 4; i++) {
            // search dictionary;
            results.addResult("result of " + dictionary);
            System.out.println("adding result from " + dictionary);
        }

        results.oneDictionaryFinished();
    }

}

And the main method for demonstration:

public static void main(String[] args) throws Exception {

    Results results = new Results();

    ThreadDictionarySearch threadA = new ThreadDictionarySearch(results, "dictionary A");
    ThreadDictionarySearch threadB = new ThreadDictionarySearch(results, "dictionary B");
    ThreadDictionarySearch threadC = new ThreadDictionarySearch(results, "dictionary C");
    ThreadDictionarySearch threadD = new ThreadDictionarySearch(results, "dictionary D");

    threadA.start();
    threadB.start();
    threadC.start();
    threadD.start();

    if (results.isReady())
    // it stays here until all dictionaries are searched
    // because in "Results" it's told to wait() while not finished;

for (String string : results.getAllResults()) {
        System.out.println("RESULT: " + string);
    }
like image 69
Johny Avatar answered Oct 19 '22 07:10

Johny