Performance of StringTokenizer class vs. String.split method in Java

Tags:

In my software I need to split string into words. I currently have more than 19,000,000 documents with more than 30 words each.

Which of the following two ways is the best way to do this (in terms of performance)?

StringTokenizer sTokenize = new StringTokenizer(s," "); while (sTokenize.hasMoreTokens()) {

String[] splitS = s.split(" "); for(int i =0; i < splitS.length; i++)

714

asked May 11 '11 14:05

JohnJohnGa

2 Answers

If your data already in a database you need to parse the string of words, I would suggest using indexOf repeatedly. Its many times faster than either solution.

However, getting the data from a database is still likely to much more expensive.

StringBuilder sb = new StringBuilder(); for (int i = 100000; i < 100000 + 60; i++)     sb.append(i).append(' '); String sample = sb.toString();  int runs = 100000; for (int i = 0; i < 5; i++) {     {         long start = System.nanoTime();         for (int r = 0; r < runs; r++) {             StringTokenizer st = new StringTokenizer(sample);             List<String> list = new ArrayList<String>();             while (st.hasMoreTokens())                 list.add(st.nextToken());         }         long time = System.nanoTime() - start;         System.out.printf("StringTokenizer took an average of %.1f us%n", time / runs / 1000.0);     }     {         long start = System.nanoTime();         Pattern spacePattern = Pattern.compile(" ");         for (int r = 0; r < runs; r++) {             List<String> list = Arrays.asList(spacePattern.split(sample, 0));         }         long time = System.nanoTime() - start;         System.out.printf("Pattern.split took an average of %.1f us%n", time / runs / 1000.0);     }     {         long start = System.nanoTime();         for (int r = 0; r < runs; r++) {             List<String> list = new ArrayList<String>();             int pos = 0, end;             while ((end = sample.indexOf(' ', pos)) >= 0) {                 list.add(sample.substring(pos, end));                 pos = end + 1;             }         }         long time = System.nanoTime() - start;         System.out.printf("indexOf loop took an average of %.1f us%n", time / runs / 1000.0);     }  }

prints

StringTokenizer took an average of 5.8 us Pattern.split took an average of 4.8 us indexOf loop took an average of 1.8 us StringTokenizer took an average of 4.9 us Pattern.split took an average of 3.7 us indexOf loop took an average of 1.7 us StringTokenizer took an average of 5.2 us Pattern.split took an average of 3.9 us indexOf loop took an average of 1.8 us StringTokenizer took an average of 5.1 us Pattern.split took an average of 4.1 us indexOf loop took an average of 1.6 us StringTokenizer took an average of 5.0 us Pattern.split took an average of 3.8 us indexOf loop took an average of 1.6 us

The cost of opening a file will be about 8 ms. As the files are so small, your cache may improve performance by a factor of 2-5x. Even so its going to spend ~10 hours opening files. The cost of using split vs StringTokenizer is far less than 0.01 ms each. To parse 19 million x 30 words * 8 letters per word should take about 10 seconds (at about 1 GB per 2 seconds)

If you want to improve performance, I suggest you have far less files. e.g. use a database. If you don't want to use an SQL database, I suggest using one of these http://nosql-database.org/

answered Sep 28 '22 11:09

Peter Lawrey

Split in Java 7 just calls indexOf for this input, see the source. Split should be very fast, close to repeated calls of indexOf.

answered Sep 28 '22 09:09

nes1983

Related questions
                            
                                Spring vs Hibernate
                            
                                Google SignIn API Exception 10
                            
                                How to encode URL to avoid special characters in Java? [duplicate]
                            
                                Spring Data Repository does not delete ManyToOne Entity
                            
                                Convert .cer certificate to .jks
                            
                                Java method naming conventions: Too many getters
                            
                                How to serialize and de-serialize objects using JAXB?
                            
                                How many ways to convert bitmap to string and vice-versa?
                            
                                Formatting a Duration in Java 8 / jsr310
                            
                                Can we use Android studio for AOSP development?
                            
                                Variable shadowing in Java [duplicate]
                            
                                The container 'Maven Dependencies' references non existing library - STS
                            
                                Using a java library from python
                            
                                Lambda Expressions for Abstract Classes
                            
                                How to assertThat String is not empty
                            
                                Convert ZonedDateTime to LocalDateTime at time zone
                            
                                increase server timeout in eclipse
                            
                                using java decompiler jd-gui and getting // INTERNAL ERROR //
                            
                                How to read multiple Integer values from a single line of input in Java?
                            
                                JavaFX Alerts and their size

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Performance of StringTokenizer class vs. String.split method in Java

Tags:

java

performance

stringtokenizer

JohnJohnGa

People also ask

2 Answers

Peter Lawrey

nes1983

Recent Activity

Donate For Us