Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performance of StringTokenizer class vs. String.split method in Java

In my software I need to split string into words. I currently have more than 19,000,000 documents with more than 30 words each.

Which of the following two ways is the best way to do this (in terms of performance)?

StringTokenizer sTokenize = new StringTokenizer(s," "); while (sTokenize.hasMoreTokens()) { 

or

String[] splitS = s.split(" "); for(int i =0; i < splitS.length; i++) 
like image 714
JohnJohnGa Avatar asked May 11 '11 14:05

JohnJohnGa


People also ask

Which is faster split or substring?

When you run this multiple times, the substring wins on time hands down: 1,000,000 iterations of split take 3.36s, while 1,000,000 iterations of substring take only 0.05s.

Is StringTokenizer deprecated in Java?

StringTokenizer is a legacy class (i.e. there is a better replacement out there), but it's not deprecated.

What are advantages of using string tokenizer class?

StringTokenizer class allows you to break a String into tokens. It is simple way to break a String. It is a legacy class of Java. It doesn't provide the facility to differentiate numbers, quoted strings, identifiers etc.


2 Answers

If your data already in a database you need to parse the string of words, I would suggest using indexOf repeatedly. Its many times faster than either solution.

However, getting the data from a database is still likely to much more expensive.

StringBuilder sb = new StringBuilder(); for (int i = 100000; i < 100000 + 60; i++)     sb.append(i).append(' '); String sample = sb.toString();  int runs = 100000; for (int i = 0; i < 5; i++) {     {         long start = System.nanoTime();         for (int r = 0; r < runs; r++) {             StringTokenizer st = new StringTokenizer(sample);             List<String> list = new ArrayList<String>();             while (st.hasMoreTokens())                 list.add(st.nextToken());         }         long time = System.nanoTime() - start;         System.out.printf("StringTokenizer took an average of %.1f us%n", time / runs / 1000.0);     }     {         long start = System.nanoTime();         Pattern spacePattern = Pattern.compile(" ");         for (int r = 0; r < runs; r++) {             List<String> list = Arrays.asList(spacePattern.split(sample, 0));         }         long time = System.nanoTime() - start;         System.out.printf("Pattern.split took an average of %.1f us%n", time / runs / 1000.0);     }     {         long start = System.nanoTime();         for (int r = 0; r < runs; r++) {             List<String> list = new ArrayList<String>();             int pos = 0, end;             while ((end = sample.indexOf(' ', pos)) >= 0) {                 list.add(sample.substring(pos, end));                 pos = end + 1;             }         }         long time = System.nanoTime() - start;         System.out.printf("indexOf loop took an average of %.1f us%n", time / runs / 1000.0);     }  } 

prints

StringTokenizer took an average of 5.8 us Pattern.split took an average of 4.8 us indexOf loop took an average of 1.8 us StringTokenizer took an average of 4.9 us Pattern.split took an average of 3.7 us indexOf loop took an average of 1.7 us StringTokenizer took an average of 5.2 us Pattern.split took an average of 3.9 us indexOf loop took an average of 1.8 us StringTokenizer took an average of 5.1 us Pattern.split took an average of 4.1 us indexOf loop took an average of 1.6 us StringTokenizer took an average of 5.0 us Pattern.split took an average of 3.8 us indexOf loop took an average of 1.6 us 

The cost of opening a file will be about 8 ms. As the files are so small, your cache may improve performance by a factor of 2-5x. Even so its going to spend ~10 hours opening files. The cost of using split vs StringTokenizer is far less than 0.01 ms each. To parse 19 million x 30 words * 8 letters per word should take about 10 seconds (at about 1 GB per 2 seconds)

If you want to improve performance, I suggest you have far less files. e.g. use a database. If you don't want to use an SQL database, I suggest using one of these http://nosql-database.org/

like image 60
Peter Lawrey Avatar answered Sep 28 '22 11:09

Peter Lawrey


Split in Java 7 just calls indexOf for this input, see the source. Split should be very fast, close to repeated calls of indexOf.

like image 45
nes1983 Avatar answered Sep 28 '22 09:09

nes1983