I have a string like:
1,2,3:3,4,5
The string on the left side of the delimiter needs to be compared to the string on the right side of the delimiter(:). Now when I mean compare, I actually mean to find if the elements in the right part (3,4,5) are present in the elements of the left part (1,2,3). The right part can contain duplicates and that's fine (evidently meaning I cannot use a HashSet). I've accomplished this (details below) but I need the fastest way to split and compare the above mentioned strings.
This is purely a performance based question to find out which method can be faster since the actual input that I will be using is huge (on either side). There would be only a single line and it will be read through stdin.
How I've accomplished this:
HashSet.ArrayList.contains() to check if the element is present in the HashSet.Read input into byte[] array to hold the pointer on the side of your code.
Read byte by byte, computing integer elements on the way:
int b = inputBytes[p++];
int d = b - '0';
if (0 <= d) {
if (d <= 9) {
element = element * 10 + d;
} else {
// b == ':'
}
} else {
// b == ','
// add element to the hash; element = 0;
...
}
if (p == inputBytesLength) {
inputBytesLength = in.read(inputBytes);
if (inputBytesLength == 0) { ... }
p = 0;
}
Use int[] with length of sufficiently big power of two as hash:
// as add()
int h = element * 0x9E3779B9;
int i = h >>> (32 - hashSizePower);
while (hash[i] != 0) {
if (--i < 0) i += hashSize;
}
hash[i] = element;
// contains() similarly
Assuming a line of input fits in JVM heap, three common approaches to parsing strings from input in Java are:
java.util.Scanner java.io.BufferedReader#readLine & java.util.StringTokenizer
java.io.BufferedReader#readLine & java.lang.String#split
It wasn’t obvious to me which approach was best for this problem, so I decided to try it out. I generated test data, implemented a parser for each approach, and timed the results.
I generated 4 files of test data:
The files I generated matched the format you described. Each , delimited element is a random integer. The number in the file name describes the number of elements on each side if the :. For example, testdata_1k.txt has 1,000 elements on the left and 1,000 elements on the right.
Here's the code I used to test each approach. Please note, these are not examples of production quality code.
public Map<String, Boolean> scanner(InputStream stream) {
final Scanner in = new Scanner(new BufferedInputStream(stream));
final HashMap<String, Boolean> result = new HashMap<String, Boolean>();
final HashSet<String> left = new HashSet<String>();
in.useDelimiter(",");
boolean leftSide = true;
while (in.hasNext()) {
String token = in.next();
if (leftSide) {
int delim = token.indexOf(':');
if (delim >= 0) {
left.add(token.substring(0, delim));
String rightToken = token.substring(delim + 1, token.length());
result.put(rightToken, left.contains(rightToken));
leftSide = false;
} else {
left.add(token);
}
} else {
result.put(token, left.contains(token));
}
}
return result;
}
public Map<String, Boolean> stringTokenizer(InputStream stream) throws IOException {
final BufferedReader in = new BufferedReader(new InputStreamReader(stream));
final HashMap<String, Boolean> result = new HashMap<String, Boolean>();
final StringTokenizer lineTokens = new StringTokenizer(in.readLine(), ":");
final HashSet<String> left = new HashSet<String>();
if (lineTokens.hasMoreTokens()) {
final StringTokenizer leftTokens = new StringTokenizer(lineTokens.nextToken(), ",");
while (leftTokens.hasMoreTokens()) {
left.add(leftTokens.nextToken());
}
}
if (lineTokens.hasMoreTokens()) {
final StringTokenizer rightTokens = new StringTokenizer(lineTokens.nextToken(), ",");
while (rightTokens.hasMoreTokens()) {
String token = rightTokens.nextToken();
result.put(token, left.contains(token));
}
}
return result;
}
public Map<String, Boolean> split(InputStream stream) throws IOException {
final BufferedReader in = new BufferedReader(new InputStreamReader(stream));
final HashMap<String, Boolean> result = new HashMap<String, Boolean>();
final String[] splitLine = in.readLine().split(":");
final HashSet<String> left = new HashSet<String>(Arrays.asList(splitLine[0].split(",")));
for (String element : splitLine[1].split(",")) {
result.put(element, left.contains(element));
}
return result;
}
I ran each approach 6 times against each file. I threw the first sample out. The following represents the average of the remaining 5 samples.
Assuming your data fits in JVM heap, it’s hard to beat the parsing speed of String.split compared to StringTokenizer and Scanner.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With