High Level Java Optimization

Tags:

There are many questions and answers and opinions about how to do low level Java optimization, with for, while, and do-while loops, and whether it's even necessary.

My question is more of a High Level based optimization in design. Let's assume I have to do the following:

for a given string input, count the occurrence of each letter in the string.

this is not a major problem when the string is a few sentences, but what if instead we want to count the occurrence of each word in a 900,000 word file. building loops just wastes time.

So what is the high level design pattern that can be applied to this type of problem.

I guess my major point is that I tend to use loops to solve many problems, and I would like to get out of the habit of using loops.

thanks in advance

Sam

p.s. If possible can you produce some pseudo code for solving the 900,000 word file problem, I tend to understand code better than I can understand English, which I assume is the same for most visitors of this site

449

asked Aug 13 '11 04:08

Sam Mohamed

2 Answers

The word count problem is one of the most widely covered problems in the Big Data world; it's kind of the Hello World of frameworks like Hadoop. You can find ample information throughout the web on this problem.

I'll give you some thoughts on it anyway.

First, 900000 words might still be small enough to build a hashmap for, so don't discount the obvious in-memory approach. You said pseudocode is fine, so:

h = new HashMap<String, Integer>();
for each word w picked up while tokenizing the file {
  h[w] = w in h ? h[w]++ : 1
}

Now once your dataset is too large to build an in-memory hashmap, you can do your counting like so:

Tokenize into words writing each word to a single line in a file
Use the Unix sort command to produce the next file
Count as you traverse the sorted file

These three steps go in a Unix pipeline. Let the OS do the work for you here.

Now, as you get even more data, you want to bring in map-reduce frameworks like hadoop to do the word counting on clusters of machines.

Now, I've heard when you get into obscenely large datasets, doing things in a distributed enviornment does not help anymore, because the transmission time overwhelms the counting time, and in your case of word counting, everything has to "be put back together anyway" so then you have to use some very sophisticated techniques that I suspect you can find in research papers.

ADDENDUM

The OP asked for an example of tokenizing the input in Java. Here is the easiest way:

import java.util.Scanner;
public class WordGenerator {
    /**
     * Tokenizes standard input into words, writing each word to standard output,
     * on per line.  Because it reads from standard input and writes to standard
     * output, it can easily be used in a pipeline combined with sort, uniq, and
     * any other such application.
     */
    public static void main(String[] args) {
        Scanner input = new Scanner(System.in);
        while (input.hasNext()) {
            System.out.println(input.next().toLowerCase());
        }
    } 
}

Now here is an example of using it:

echo -e "Hey Moe! Woo\nwoo woo nyuk-nyuk why soitenly. Hey." | java WordGenerator

This outputs

hey
moe!
woo
woo
woo
nyuk-nyuk
why
soitenly.
hey.

You can combine this tokenizer with sort and uniq like so:

echo -e "Hey Moe! Woo\nwoo woo nyuk-nyuk why soitenly. Hey." | java WordGenerator | sort | uniq

Yielding

hey
hey.
moe!
nyuk-nyuk
soitenly.
why
woo

Now if you only want to keep letters and throw away all punctuation, digits and other characters, change your scanner definition line to:

Scanner input = new Scanner(System.in).useDelimiter(Pattern.compile("\\P{L}"));

And now

echo -e "Hey Moe! Woo\nwoo woo^nyuk-nyuk why#2soitenly. Hey." | java WordGenerator | sort | uniq

Yields

hey
moe
nyuk
soitenly
why
woo

There is a blank line in the output; I'll let you figure out how to whack it. :)

132

answered Sep 27 '22 22:09

Ray Toal

The fastest solution to this is O(n) AFAIK use a loop to iterate the string, get the character and update the count in a HashMap accordingly. At the end the HashMap contains all the characters that occurred and a count of all the occurrences.

Some pseduo-code (may not compile)

HashMap<Character, Integer> map = new HashMap<Character, Integer>();
for (int i = 0; i < str.length(); i++)
{
    char c = str.charAt(i);
    if (map.containsKey(c)) map.put(c, map.get(c) + 1);
    else map.put(c, 1);
}

answered Sep 27 '22 22:09

Jesus Ramos

Related questions
                            
                                Getting all users with a Role in Liferay
                            
                                IllegalArgumentException readExceptionFromParcel
                            
                                Java: What is the proper way to declare an interface method that can throw an exception?
                            
                                how to initialize a servlet on sever startup
                            
                                PreparedStatement is not reading all my parameters for PostGIS Geography
                            
                                Hibernate/JPA bug - Not recognizing some strings in a enum
                            
                                Need Help Defining/Understanding the Java EE "Stack"
                            
                                Can't find Window builder pro in eclipse indigo
                            
                                Is this practical on a small supercomputer?
                            
                                Byte array of unknown length in Java: Part II
                            
                                How to rename XML node name
                            
                                How to draw a border around a mapview
                            
                                How to avoid many try catch blocks in java
                            
                                Sorting algorithm which does not allow counting of elements
                            
                                How do I resolve "WILL_NOT_PERFORM" MS AD reply when trying to change password in scala w/ the unboundid LDAP SDK?
                            
                                Eclipse: Is there a way to force an import in Organize Import to solve an ambiguity?
                            
                                How to configure EAR to access exisiting Websphere Shared Lib?
                            
                                Casting and Generics, Any performance difference?
                            
                                How do I control the centering of JButton HTML text in NetBeans?
                            
                                How to get previous URL?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

High Level Java Optimization

Tags:

java

language-agnostic

algorithm

distributed

bigdata

Sam Mohamed

People also ask

2 Answers

Ray Toal

Jesus Ramos

Recent Activity

Donate For Us