String split and compare - fastest method

Question

I have a string like:

1,2,3:3,4,5

The string on the left side of the delimiter needs to be compared to the string on the right side of the delimiter(:). Now when I mean compare, I actually mean to find if the elements in the right part (3,4,5) are present in the elements of the left part (1,2,3). The right part can contain duplicates and that's fine (evidently meaning I cannot use a HashSet). I've accomplished this (details below) but I need the fastest way to split and compare the above mentioned strings.

This is purely a performance based question to find out which method can be faster since the actual input that I will be using is huge (on either side). There would be only a single line and it will be read through stdin.

How I've accomplished this:

Read stdin.
Split using string.split and store the left part in a HashSet.
Store the right part in an ArrayList.
Iterate through the array list use contains() to check if the element is present in the HashSet.

leventov · Accepted Answer

Read input into byte[] array to hold the pointer on the side of your code.

Read byte by byte, computing integer elements on the way:

int b = inputBytes[p++];
int d = b - '0';
if (0 <= d) {
    if (d <= 9) {
        element = element * 10 + d;
    } else {
        // b == ':'
    }
} else {
    // b == ','
    // add element to the hash; element = 0;
    ...
}
if (p == inputBytesLength) {
    inputBytesLength = in.read(inputBytes);
    if (inputBytesLength == 0) { ... }
    p = 0;
}

Use int[] with length of sufficiently big power of two as hash:

// as add()
int h = element * 0x9E3779B9;
int i = h >>> (32 - hashSizePower);
while (hash[i] != 0) {
    if (--i < 0) i += hashSize;
}
hash[i] = element;

// contains() similarly

darin · Answer

Assuming a line of input fits in JVM heap, three common approaches to parsing strings from input in Java are:

java.util.Scanner
java.io.BufferedReader#readLine & java.util.StringTokenizer
java.io.BufferedReader#readLine & java.lang.String#split

It wasn’t obvious to me which approach was best for this problem, so I decided to try it out. I generated test data, implemented a parser for each approach, and timed the results.

Test Data

I generated 4 files of test data:

testdata_1k.txt - size 20KB
testdata_10k.txt - size 205KB
testdata_100k.txt - size 2MB
testdata_1000k.txt - size 20M

The files I generated matched the format you described. Each , delimited element is a random integer. The number in the file name describes the number of elements on each side if the :. For example, testdata_1k.txt has 1,000 elements on the left and 1,000 elements on the right.

Test Code

Here's the code I used to test each approach. Please note, these are not examples of production quality code.

Scanner Code

public Map<String, Boolean> scanner(InputStream stream) {
    final Scanner in = new Scanner(new BufferedInputStream(stream));
    final HashMap<String, Boolean> result = new HashMap<String, Boolean>();
    final HashSet<String> left = new HashSet<String>();

    in.useDelimiter(",");
    boolean leftSide = true;
    while (in.hasNext()) {
        String token = in.next();
        if (leftSide) {
            int delim = token.indexOf(':');
            if (delim >= 0) {
                left.add(token.substring(0, delim));
                String rightToken = token.substring(delim + 1, token.length());
                result.put(rightToken, left.contains(rightToken));
                leftSide = false;
            } else {
                left.add(token);
            }
        } else {
            result.put(token, left.contains(token));
        }
    }
    return result;
}

StringTokenizer Code

public Map<String, Boolean> stringTokenizer(InputStream stream) throws IOException {
    final BufferedReader in = new BufferedReader(new InputStreamReader(stream));
    final HashMap<String, Boolean> result = new HashMap<String, Boolean>();

    final StringTokenizer lineTokens = new StringTokenizer(in.readLine(), ":");
    final HashSet<String> left = new HashSet<String>();
    if (lineTokens.hasMoreTokens()) {
        final StringTokenizer leftTokens = new StringTokenizer(lineTokens.nextToken(), ",");
        while (leftTokens.hasMoreTokens()) {
            left.add(leftTokens.nextToken());
        }
    }
    if (lineTokens.hasMoreTokens()) {
        final StringTokenizer rightTokens = new StringTokenizer(lineTokens.nextToken(), ",");
        while (rightTokens.hasMoreTokens()) {
            String token = rightTokens.nextToken();
            result.put(token, left.contains(token));
        }
    }
    return result;
}

String.split Code

public Map<String, Boolean> split(InputStream stream) throws IOException {
    final BufferedReader in = new BufferedReader(new InputStreamReader(stream));
    final HashMap<String, Boolean> result = new HashMap<String, Boolean>();

    final String[] splitLine = in.readLine().split(":");
    final HashSet<String> left = new HashSet<String>(Arrays.asList(splitLine[0].split(",")));

    for (String element : splitLine[1].split(",")) {
        result.put(element, left.contains(element));
    }
    return result;
}

Timing

I ran each approach 6 times against each file. I threw the first sample out. The following represents the average of the remaining 5 samples.

Scanner

testdata_1k.txt - 23.2948 millis
testdata_10k.txt - 39.5036 millis
testdata_100k.txt - 240.5626 millis
testdata_1000k.txt - 2671.5132 millis

StringTokenizer

testdata_1k.txt - 31.2344 millis
testdata_10k.txt -14.7926 millis
testdata_100k.txt - 102.6412 millis
testdata_1000k.txt - 1353.073 millis

String.split

testdata_1k.txt - 8.9596 millis
testdata_10k.txt - 7.8396 millis
testdata_100k.txt - 63.4854 millis
testdata_1000k.txt - 947.8384 millis

Conclusion

Assuming your data fits in JVM heap, it’s hard to beat the parsing speed of String.split compared to StringTokenizer and Scanner.

String split and compare - fastest method

Tags:

java

performance

string

split

Subashini P

2 Answers

leventov

Test Data

Test Code

Scanner Code

StringTokenizer Code

String.split Code

Timing

Scanner

StringTokenizer

String.split

Conclusion

darin

Recent Activity

Donate For Us

String split and compare - fastest method

Tags:

java

performance

string

split

Subashini P

2 Answers

leventov

Test Data

Test Code

Scanner Code

StringTokenizer Code

String.split Code

Timing

Scanner

StringTokenizer

String.split

Conclusion

darin

Related questions

Recent Activity

Donate For Us