Encouraged by this, and the fact I have billions of string to parse, I tried to modify my code to accept StringTokenizer instead of String[]
The only thing left between me and getting that delicious x2 performance boost is the fact that when you're doing
"dog,,cat".split(",")
//output: ["dog","","cat"]
StringTokenizer("dog,,cat")
// nextToken() = "dog"
// nextToken() = "cat"
How can I achieve similar results with the StringTokenizer? Are there faster ways to do this?
Are you only actually tokenizing on commas? If so, I'd write my own tokenizer - it may well end up being even more efficient than the more general purpose StringTokenizer which can look for multiple tokens, and you can make it behave however you'd like. For such a simple use case, it can be a simple implementation.
If it would be useful, you could even implement Iterable<String>
and get enhanced-for-loop support with strong typing instead of the Enumeration
support provided by StringTokenizer
. Let me know if you want any help coding such a beast up - it really shouldn't be too hard.
Additionally, I'd try running performance tests on your actual data before leaping too far from an existing solution. Do you have any idea how much of your execution time is actually spent in String.split
? I know you have a lot of strings to parse, but if you're doing anything significant with them afterwards, I'd expect that to be much more significant than the splitting.
After tinkering with the StringTokenizer
class, I could not find a way to satisfy the requirements to return ["dog", "", "cat"]
.
Furthermore, the StringTokenizer
class is left only for compatibility reasons, and the use of String.split
is encouaged. From the API Specification for the StringTokenizer
:
StringTokenizer
is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use thesplit
method ofString
or thejava.util.regex
package instead.
Since the issue is the supposedly poor performance of the String.split
method, we need to find an alternative.
Note: I am saying "supposedly poor performance" because it's hard to determine that every use case is going to result in the StringTokenizer
being superior to the String.split
method. Furthermore, in many cases, unless the tokenization of the strings are indeed the bottleneck of the application determined by proper profiling, I feel that it will end up being a premature optimization, if anything. I would be inclined to say write code that is meaningful and easy to understand before venturing on optimization.
Now, from the current requirements, probably rolling our own tokenizer wouldn't be too difficult.
Roll our own tokenzier!
The following is a simple tokenizer I wrote. I should note that there are no speed optimizations, nor is there error-checks to prevent going past the end of the string -- this is a quick-and-dirty implementation:
class MyTokenizer implements Iterable<String>, Iterator<String> {
String delim = ",";
String s;
int curIndex = 0;
int nextIndex = 0;
boolean nextIsLastToken = false;
public MyTokenizer(String s, String delim) {
this.s = s;
this.delim = delim;
}
public Iterator<String> iterator() {
return this;
}
public boolean hasNext() {
nextIndex = s.indexOf(delim, curIndex);
if (nextIsLastToken)
return false;
if (nextIndex == -1)
nextIsLastToken = true;
return true;
}
public String next() {
if (nextIndex == -1)
nextIndex = s.length();
String token = s.substring(curIndex, nextIndex);
curIndex = nextIndex + 1;
return token;
}
public void remove() {
throw new UnsupportedOperationException();
}
}
The MyTokenizer
will take a String
to tokenize and a String
as a delimiter, and use the String.indexOf
method to perform the search for delimiters. Tokens are produced by the String.substring
method.
I would suspect there could be some performance improvements by working on the string at the char[]
level rather than at the String
level. But I'll leave that up as an exercise to the reader.
The class also implements Iterable
and Iterator
in order to take advantage of the for-each
loop construct that was introduced in Java 5. StringTokenizer
is an Enumerator
, and does not support the for-each
construct.
Is it any faster?
In order to find out if this is any faster, I wrote a program to compare speeds in the following four methods:
StringTokenizer
.MyTokenizer
.String.split
.Pattern.compile
.In the four methods, the string "dog,,cat"
was separated into tokens. Although the StringTokenizer
is included in the comparison, it should be noted that it will not return the desired result of ["dog", "", "cat]
.
The tokenizing was repeated for a total of 1 million times to give take enough time to notice the difference in the methods.
The code used for the simple benchmark was the following:
long st = System.currentTimeMillis();
for (int i = 0; i < 1e6; i++) {
StringTokenizer t = new StringTokenizer("dog,,cat", ",");
while (t.hasMoreTokens()) {
t.nextToken();
}
}
System.out.println(System.currentTimeMillis() - st);
st = System.currentTimeMillis();
for (int i = 0; i < 1e6; i++) {
MyTokenizer mt = new MyTokenizer("dog,,cat", ",");
for (String t : mt) {
}
}
System.out.println(System.currentTimeMillis() - st);
st = System.currentTimeMillis();
for (int i = 0; i < 1e6; i++) {
String[] tokens = "dog,,cat".split(",");
for (String t : tokens) {
}
}
System.out.println(System.currentTimeMillis() - st);
st = System.currentTimeMillis();
Pattern p = Pattern.compile(",");
for (int i = 0; i < 1e6; i++) {
String[] tokens = p.split("dog,,cat");
for (String t : tokens) {
}
}
System.out.println(System.currentTimeMillis() - st);
The Results
The tests were run using Java SE 6 (build 1.6.0_12-b04), and results were the following:
Run 1 Run 2 Run 3 Run 4 Run 5 ----- ----- ----- ----- ----- StringTokenizer 172 188 187 172 172 MyTokenizer 234 234 235 234 235 String.split 1172 1156 1171 1172 1156 Pattern.compile 906 891 891 907 906
So, as can be seen from the limited testing and only five runs, the StringTokenizer
did in fact come out the fastest, but the MyTokenizer
came in as a close 2nd. Then, String.split
was the slowest, and the precompiled regular expression was slightly faster than the split
method.
As with any little benchmark, it probably isn't very representative of real-life conditions, so the results should be taken with a grain (or a mound) of salt.
Note: Having done some quick benchmarks, Scanner turns out to be about four times slower than String.split. Hence, do not use Scanner.
(I'm leaving the post up to record the fact that Scanner is a bad idea in this case. (Read as: do not downvote me for suggesting Scanner, please...))
Assuming you are using Java 1.5 or higher, try Scanner, which implements Iterator<String>
, as it happens:
Scanner sc = new Scanner("dog,,cat");
sc.useDelimiter(",");
while (sc.hasNext()) {
System.out.println(sc.next());
}
gives:
dog
cat
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With