I have a string like:
1,2,3:3,4,5
The string on the left side of the delimiter needs to be compared to the string on the right side of the delimiter(:
). Now when I mean compare, I actually mean to find if the elements in the right part (3,4,5)
are present in the elements of the left part (1,2,3)
. The right part can contain duplicates and that's fine (evidently meaning I cannot use a HashSet
). I've accomplished this (details below) but I need the fastest way to split and compare the above mentioned strings.
This is purely a performance based question to find out which method can be faster since the actual input that I will be using is huge (on either side). There would be only a single line and it will be read through stdin.
How I've accomplished this:
HashSet
.ArrayList
.contains()
to check if the element is present in the HashSet
.Read input into byte[]
array to hold the pointer on the side of your code.
Read byte by byte, computing integer elements on the way:
int b = inputBytes[p++];
int d = b - '0';
if (0 <= d) {
if (d <= 9) {
element = element * 10 + d;
} else {
// b == ':'
}
} else {
// b == ','
// add element to the hash; element = 0;
...
}
if (p == inputBytesLength) {
inputBytesLength = in.read(inputBytes);
if (inputBytesLength == 0) { ... }
p = 0;
}
Use int[]
with length of sufficiently big power of two as hash:
// as add()
int h = element * 0x9E3779B9;
int i = h >>> (32 - hashSizePower);
while (hash[i] != 0) {
if (--i < 0) i += hashSize;
}
hash[i] = element;
// contains() similarly
Assuming a line of input fits in JVM heap, three common approaches to parsing strings from input in Java are:
java.util.Scanner
java.io.BufferedReader#readLine
& java.util.StringTokenizer
java.io.BufferedReader#readLine
& java.lang.String#split
It wasn’t obvious to me which approach was best for this problem, so I decided to try it out. I generated test data, implemented a parser for each approach, and timed the results.
I generated 4 files of test data:
The files I generated matched the format you described. Each ,
delimited element is a random integer. The number in the file name describes the number of elements on each side if the :
. For example, testdata_1k.txt has 1,000 elements on the left and 1,000 elements on the right.
Here's the code I used to test each approach. Please note, these are not examples of production quality code.
public Map<String, Boolean> scanner(InputStream stream) {
final Scanner in = new Scanner(new BufferedInputStream(stream));
final HashMap<String, Boolean> result = new HashMap<String, Boolean>();
final HashSet<String> left = new HashSet<String>();
in.useDelimiter(",");
boolean leftSide = true;
while (in.hasNext()) {
String token = in.next();
if (leftSide) {
int delim = token.indexOf(':');
if (delim >= 0) {
left.add(token.substring(0, delim));
String rightToken = token.substring(delim + 1, token.length());
result.put(rightToken, left.contains(rightToken));
leftSide = false;
} else {
left.add(token);
}
} else {
result.put(token, left.contains(token));
}
}
return result;
}
public Map<String, Boolean> stringTokenizer(InputStream stream) throws IOException {
final BufferedReader in = new BufferedReader(new InputStreamReader(stream));
final HashMap<String, Boolean> result = new HashMap<String, Boolean>();
final StringTokenizer lineTokens = new StringTokenizer(in.readLine(), ":");
final HashSet<String> left = new HashSet<String>();
if (lineTokens.hasMoreTokens()) {
final StringTokenizer leftTokens = new StringTokenizer(lineTokens.nextToken(), ",");
while (leftTokens.hasMoreTokens()) {
left.add(leftTokens.nextToken());
}
}
if (lineTokens.hasMoreTokens()) {
final StringTokenizer rightTokens = new StringTokenizer(lineTokens.nextToken(), ",");
while (rightTokens.hasMoreTokens()) {
String token = rightTokens.nextToken();
result.put(token, left.contains(token));
}
}
return result;
}
public Map<String, Boolean> split(InputStream stream) throws IOException {
final BufferedReader in = new BufferedReader(new InputStreamReader(stream));
final HashMap<String, Boolean> result = new HashMap<String, Boolean>();
final String[] splitLine = in.readLine().split(":");
final HashSet<String> left = new HashSet<String>(Arrays.asList(splitLine[0].split(",")));
for (String element : splitLine[1].split(",")) {
result.put(element, left.contains(element));
}
return result;
}
I ran each approach 6 times against each file. I threw the first sample out. The following represents the average of the remaining 5 samples.
Assuming your data fits in JVM heap, it’s hard to beat the parsing speed of String.split
compared to StringTokenizer
and Scanner
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With