Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

String split and compare - fastest method

I have a string like:

1,2,3:3,4,5

The string on the left side of the delimiter needs to be compared to the string on the right side of the delimiter(:). Now when I mean compare, I actually mean to find if the elements in the right part (3,4,5) are present in the elements of the left part (1,2,3). The right part can contain duplicates and that's fine (evidently meaning I cannot use a HashSet). I've accomplished this (details below) but I need the fastest way to split and compare the above mentioned strings.

This is purely a performance based question to find out which method can be faster since the actual input that I will be using is huge (on either side). There would be only a single line and it will be read through stdin.

How I've accomplished this:

  1. Read stdin.
  2. Split using string.split and store the left part in a HashSet.
  3. Store the right part in an ArrayList.
  4. Iterate through the array list use contains() to check if the element is present in the HashSet.
like image 966
Subashini P Avatar asked Oct 03 '22 10:10

Subashini P


2 Answers

  1. Read input into byte[] array to hold the pointer on the side of your code.

  2. Read byte by byte, computing integer elements on the way:

    int b = inputBytes[p++];
    int d = b - '0';
    if (0 <= d) {
        if (d <= 9) {
            element = element * 10 + d;
        } else {
            // b == ':'
        }
    } else {
        // b == ','
        // add element to the hash; element = 0;
        ...
    }
    if (p == inputBytesLength) {
        inputBytesLength = in.read(inputBytes);
        if (inputBytesLength == 0) { ... }
        p = 0;
    }
    
  3. Use int[] with length of sufficiently big power of two as hash:

    // as add()
    int h = element * 0x9E3779B9;
    int i = h >>> (32 - hashSizePower);
    while (hash[i] != 0) {
        if (--i < 0) i += hashSize;
    }
    hash[i] = element;
    
    // contains() similarly
    
like image 71
leventov Avatar answered Oct 07 '22 02:10

leventov


Assuming a line of input fits in JVM heap, three common approaches to parsing strings from input in Java are:

  1. java.util.Scanner
  2. java.io.BufferedReader#readLine & java.util.StringTokenizer
  3. java.io.BufferedReader#readLine & java.lang.String#split

It wasn’t obvious to me which approach was best for this problem, so I decided to try it out. I generated test data, implemented a parser for each approach, and timed the results.

Test Data

I generated 4 files of test data:

  • testdata_1k.txt - size 20KB
  • testdata_10k.txt - size 205KB
  • testdata_100k.txt - size 2MB
  • testdata_1000k.txt - size 20M

The files I generated matched the format you described. Each , delimited element is a random integer. The number in the file name describes the number of elements on each side if the :. For example, testdata_1k.txt has 1,000 elements on the left and 1,000 elements on the right.

Test Code

Here's the code I used to test each approach. Please note, these are not examples of production quality code.

Scanner Code

public Map<String, Boolean> scanner(InputStream stream) {
    final Scanner in = new Scanner(new BufferedInputStream(stream));
    final HashMap<String, Boolean> result = new HashMap<String, Boolean>();
    final HashSet<String> left = new HashSet<String>();

    in.useDelimiter(",");
    boolean leftSide = true;
    while (in.hasNext()) {
        String token = in.next();
        if (leftSide) {
            int delim = token.indexOf(':');
            if (delim >= 0) {
                left.add(token.substring(0, delim));
                String rightToken = token.substring(delim + 1, token.length());
                result.put(rightToken, left.contains(rightToken));
                leftSide = false;
            } else {
                left.add(token);
            }
        } else {
            result.put(token, left.contains(token));
        }
    }
    return result;
}

StringTokenizer Code

public Map<String, Boolean> stringTokenizer(InputStream stream) throws IOException {
    final BufferedReader in = new BufferedReader(new InputStreamReader(stream));
    final HashMap<String, Boolean> result = new HashMap<String, Boolean>();

    final StringTokenizer lineTokens = new StringTokenizer(in.readLine(), ":");
    final HashSet<String> left = new HashSet<String>();
    if (lineTokens.hasMoreTokens()) {
        final StringTokenizer leftTokens = new StringTokenizer(lineTokens.nextToken(), ",");
        while (leftTokens.hasMoreTokens()) {
            left.add(leftTokens.nextToken());
        }
    }
    if (lineTokens.hasMoreTokens()) {
        final StringTokenizer rightTokens = new StringTokenizer(lineTokens.nextToken(), ",");
        while (rightTokens.hasMoreTokens()) {
            String token = rightTokens.nextToken();
            result.put(token, left.contains(token));
        }
    }
    return result;
}

String.split Code

public Map<String, Boolean> split(InputStream stream) throws IOException {
    final BufferedReader in = new BufferedReader(new InputStreamReader(stream));
    final HashMap<String, Boolean> result = new HashMap<String, Boolean>();

    final String[] splitLine = in.readLine().split(":");
    final HashSet<String> left = new HashSet<String>(Arrays.asList(splitLine[0].split(",")));

    for (String element : splitLine[1].split(",")) {
        result.put(element, left.contains(element));
    }
    return result;
}

Timing

I ran each approach 6 times against each file. I threw the first sample out. The following represents the average of the remaining 5 samples.

Scanner

  • testdata_1k.txt - 23.2948 millis
  • testdata_10k.txt - 39.5036 millis
  • testdata_100k.txt - 240.5626 millis
  • testdata_1000k.txt - 2671.5132 millis

StringTokenizer

  • testdata_1k.txt - 31.2344 millis
  • testdata_10k.txt -14.7926 millis
  • testdata_100k.txt - 102.6412 millis
  • testdata_1000k.txt - 1353.073 millis

String.split

  • testdata_1k.txt - 8.9596 millis
  • testdata_10k.txt - 7.8396 millis
  • testdata_100k.txt - 63.4854 millis
  • testdata_1000k.txt - 947.8384 millis

Conclusion

Assuming your data fits in JVM heap, it’s hard to beat the parsing speed of String.split compared to StringTokenizer and Scanner.

like image 28
darin Avatar answered Oct 07 '22 02:10

darin