What is the fastest way to strip all non-printable characters from a <code>String</code> in Java? So far I've tried and measured on 138-byte, 131-character String: <ul> <li>String's <code>replaceAll()</code> - slowest method <ul> <li>517009 results / sec</li> </ul> </li> <li>Precompile a Pattern, then use Matcher's <code>replaceAll()</code> <ul> <li>637836 results / sec</li> </ul> </li> <li>Use StringBuffer, get codepoints using <code>codepointAt()</code> one-by-one and append to StringBuffer <ul> <li>711946 results / sec</li> </ul> </li> <li>Use StringBuffer, get chars using <code>charAt()</code> one-by-one and append to StringBuffer <ul> <li>1052964 results / sec</li> </ul> </li> <li>Preallocate a <code>char[]</code> buffer, get chars using <code>charAt()</code> one-by-one and fill this buffer, then convert back to String <ul> <li>2022653 results / sec</li> </ul> </li> <li>Preallocate 2 <code>char[]</code> buffers - old and new, get all chars for existing String at once using <code>getChars()</code>, iterate over old buffer one-by-one and fill new buffer, then convert new buffer to String - my own fastest version <ul> <li>2502502 results / sec</li> </ul> </li> <li>Same stuff with 2 buffers - only using <code>byte[]</code>, <code>getBytes()</code> and specifying encoding as "utf-8" <ul> <li>857485 results / sec</li> </ul> </li> <li>Same stuff with 2 <code>byte[]</code> buffers, but specifying encoding as a constant <code>Charset.forName("utf-8")</code> <ul> <li>791076 results / sec</li> </ul> </li> <li>Same stuff with 2 <code>byte[]</code> buffers, but specifying encoding as 1-byte local encoding (barely a sane thing to do) <ul> <li>370164 results / sec</li> </ul> </li> </ul> My best try was the following: <pre class="prettyprint"><code> char[] oldChars = new char[s.length()]; s.getChars(0, s.length(), oldChars, 0); char[] newChars = new char[s.length()]; int newLen = 0; for (int j = 0; j < s.length(); j++) { char ch = oldChars[j]; if (ch >= ' ') { newChars[newLen] = ch; newLen++; } } s = new String(newChars, 0, newLen); </code></pre> Any thoughts on how to make it even faster? Bonus points for answering a very strange question: why using "utf-8" charset name directly yields better performance than using pre-allocated static const <code>Charset.forName("utf-8")</code>? <h3>Update</h3> <ul> <li>Suggestion from ratchet freak yields impressive 3105590 results / sec performance, a +24% improvement!</li> <li>Suggestion from Ed Staub yields yet another improvement - 3471017 results / sec, a +12% over previous best.</li> </ul> <h3>Update 2</h3> I've tried my best to collected all the proposed solutions and its cross-mutations and published it as a small benchmarking framework at github. Currently it sports 17 algorithms. One of them is "special" - Voo1 algorithm (provided by SO user Voo) employs intricate reflection tricks thus achieving stellar speeds, but it messes up JVM strings' state, thus it's benchmarked separately. You're welcome to check it out and run it to determine results on your box. Here's a summary of results I've got on mine. It's specs: <ul> <li>Debian sid</li> <li>Linux 2.6.39-2-amd64 (x86_64)</li> <li>Java installed from a package <code>sun-java6-jdk-6.24-1</code>, JVM identifies itself as <ul> <li>Java(TM) SE Runtime Environment (build 1.6.0_24-b07)</li> <li>Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)</li> </ul> </li> </ul> Different algorithms show ultimately different results given a different set of input data. I've ran a benchmark in 3 modes: <h3>Same single string</h3> This mode works on a same single string provided by <code>StringSource</code> class as a constant. The showdown is: <pre class="prettyprint"> Ops / s │ Algorithm ──────────┼────────────────────────────── 6 535 947 │ Voo1 ──────────┼────────────────────────────── 5 350 454 │ RatchetFreak2EdStaub1GreyCat1 5 249 343 │ EdStaub1 5 002 501 │ EdStaub1GreyCat1 4 859 086 │ ArrayOfCharFromStringCharAt 4 295 532 │ RatchetFreak1 4 045 307 │ ArrayOfCharFromArrayOfChar 2 790 178 │ RatchetFreak2EdStaub1GreyCat2 2 583 311 │ RatchetFreak2 1 274 859 │ StringBuilderChar 1 138 174 │ StringBuilderCodePoint 994 727 │ ArrayOfByteUTF8String 918 611 │ ArrayOfByteUTF8Const 756 086 │ MatcherReplace 598 945 │ StringReplaceAll 460 045 │ ArrayOfByteWindows1251 </pre> In charted form: <img src="https://i.stack.imgur.com/5n8k5.png" alt="Same single string chart"> (source: greycat.ru) <h3>Multiple strings, 100% of strings contain control characters</h3> Source string provider pre-generated lots of random strings using (0..127) character set - thus almost all strings contained at least one control character. Algorithms received strings from this pre-generated array in round-robin fashion. <pre class="prettyprint"> Ops / s │ Algorithm ──────────┼────────────────────────────── 2 123 142 │ Voo1 ──────────┼────────────────────────────── 1 782 214 │ EdStaub1 1 776 199 │ EdStaub1GreyCat1 1 694 628 │ ArrayOfCharFromStringCharAt 1 481 481 │ ArrayOfCharFromArrayOfChar 1 460 067 │ RatchetFreak2EdStaub1GreyCat1 1 438 435 │ RatchetFreak2EdStaub1GreyCat2 1 366 494 │ RatchetFreak2 1 349 710 │ RatchetFreak1 893 176 │ ArrayOfByteUTF8String 817 127 │ ArrayOfByteUTF8Const 778 089 │ StringBuilderChar 734 754 │ StringBuilderCodePoint 377 829 │ ArrayOfByteWindows1251 224 140 │ MatcherReplace 211 104 │ StringReplaceAll </pre> In charted form: <img src="https://i.stack.imgur.com/JogeX.png" alt="Multiple strings, 100% concentration"> (source: greycat.ru) <h3>Multiple strings, 1% of strings contain control characters</h3> Same as previous, but only 1% of strings was generated with control characters - other 99% was generated in using [32..127] character set, so they couldn't contain control characters at all. This synthetic load comes the closest to real world application of this algorithm at my place. <pre class="prettyprint"> Ops / s │ Algorithm ──────────┼────────────────────────────── 3 711 952 │ Voo1 ──────────┼────────────────────────────── 2 851 440 │ EdStaub1GreyCat1 2 455 796 │ EdStaub1 2 426 007 │ ArrayOfCharFromStringCharAt 2 347 969 │ RatchetFreak2EdStaub1GreyCat2 2 242 152 │ RatchetFreak1 2 171 553 │ ArrayOfCharFromArrayOfChar 1 922 707 │ RatchetFreak2EdStaub1GreyCat1 1 857 010 │ RatchetFreak2 1 023 751 │ ArrayOfByteUTF8String 939 055 │ StringBuilderChar 907 194 │ ArrayOfByteUTF8Const 841 963 │ StringBuilderCodePoint 606 465 │ MatcherReplace 501 555 │ StringReplaceAll 381 185 │ ArrayOfByteWindows1251 </pre> In charted form: <img src="https://i.stack.imgur.com/TnQAd.png" alt="Multiple strings, 1% concentration"> (source: greycat.ru) It's very hard for me to decide on who provided the best answer, but given the real-world application best solution was given/inspired by Ed Staub, I guess it would be fair to mark his answer. Thanks for all who took part in this, your input was very helpful and invaluable. Feel free to run the test suite on your box and propose even better solutions (working JNI solution, anyone?). <h3>References</h3> <ul> <li> GitHub repository with a benchmarking suite</li> </ul>

using 1 char array could work a bit better <pre class="prettyprint"><code>int length = s.length(); char[] oldChars = new char[length]; s.getChars(0, length, oldChars, 0); int newLen = 0; for (int j = 0; j < length; j++) { char ch = oldChars[j]; if (ch >= ' ') { oldChars[newLen] = ch; newLen++; } } s = new String(oldChars, 0, newLen); </code></pre> and I avoided repeated calls to <code>s.length();</code> another micro-optimization that might work is <pre class="prettyprint"><code>int length = s.length(); char[] oldChars = new char[length+1]; s.getChars(0, length, oldChars, 0); oldChars[length]='\0';//avoiding explicit bound check in while int newLen=-1; while(oldChars[++newLen]>=' ');//find first non-printable, // if there are none it ends on the null char I appended for (int j = newLen; j < length; j++) { char ch = oldChars[j]; if (ch >= ' ') { oldChars[newLen] = ch;//the while avoids repeated overwriting here when newLen==j newLen++; } } s = new String(oldChars, 0, newLen); </code></pre>

If it is reasonable to embed this method in a class which is not shared across threads, then you can reuse the buffer: <pre class="prettyprint"><code>char [] oldChars = new char[5]; String stripControlChars(String s) { final int inputLen = s.length(); if ( oldChars.length < inputLen ) { oldChars = new char[inputLen]; } s.getChars(0, inputLen, oldChars, 0); </code></pre> etc... This is a big win - 20% or so, as I understand the current best case. If this is to be used on potentially large strings and the memory "leak" is a concern, a weak reference can be used.

Fastest way to strip all non-printable characters from a Java String

Tags:

java

performance

string

optimization

micro-optimization

What is the fastest way to strip all non-printable characters from a String in Java?

So far I've tried and measured on 138-byte, 131-character String:

String's replaceAll() - slowest method
- 517009 results / sec
Precompile a Pattern, then use Matcher's replaceAll()
- 637836 results / sec
Use StringBuffer, get codepoints using codepointAt() one-by-one and append to StringBuffer
- 711946 results / sec
Use StringBuffer, get chars using charAt() one-by-one and append to StringBuffer
- 1052964 results / sec
Preallocate a char[] buffer, get chars using charAt() one-by-one and fill this buffer, then convert back to String
- 2022653 results / sec
Preallocate 2 char[] buffers - old and new, get all chars for existing String at once using getChars(), iterate over old buffer one-by-one and fill new buffer, then convert new buffer to String - my own fastest version
- 2502502 results / sec
Same stuff with 2 buffers - only using byte[], getBytes() and specifying encoding as "utf-8"
- 857485 results / sec
Same stuff with 2 byte[] buffers, but specifying encoding as a constant Charset.forName("utf-8")
- 791076 results / sec
Same stuff with 2 byte[] buffers, but specifying encoding as 1-byte local encoding (barely a sane thing to do)
- 370164 results / sec

My best try was the following:

    char[] oldChars = new char[s.length()];     s.getChars(0, s.length(), oldChars, 0);     char[] newChars = new char[s.length()];     int newLen = 0;     for (int j = 0; j < s.length(); j++) {         char ch = oldChars[j];         if (ch >= ' ') {             newChars[newLen] = ch;             newLen++;         }     }     s = new String(newChars, 0, newLen);

Any thoughts on how to make it even faster?

Bonus points for answering a very strange question: why using "utf-8" charset name directly yields better performance than using pre-allocated static const Charset.forName("utf-8")?

Update

Suggestion from ratchet freak yields impressive 3105590 results / sec performance, a +24% improvement!
Suggestion from Ed Staub yields yet another improvement - 3471017 results / sec, a +12% over previous best.

Update 2

I've tried my best to collected all the proposed solutions and its cross-mutations and published it as a small benchmarking framework at github. Currently it sports 17 algorithms. One of them is "special" - Voo1 algorithm (provided by SO user Voo) employs intricate reflection tricks thus achieving stellar speeds, but it messes up JVM strings' state, thus it's benchmarked separately.

You're welcome to check it out and run it to determine results on your box. Here's a summary of results I've got on mine. It's specs:

Debian sid
Linux 2.6.39-2-amd64 (x86_64)
Java installed from a package sun-java6-jdk-6.24-1, JVM identifies itself as
- Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
- Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)

Different algorithms show ultimately different results given a different set of input data. I've ran a benchmark in 3 modes:

Same single string

This mode works on a same single string provided by StringSource class as a constant. The showdown is:

  Ops / s  │ Algorithm ──────────┼────────────────────────────── 6 535 947 │ Voo1 ──────────┼────────────────────────────── 5 350 454 │ RatchetFreak2EdStaub1GreyCat1 5 249 343 │ EdStaub1 5 002 501 │ EdStaub1GreyCat1 4 859 086 │ ArrayOfCharFromStringCharAt 4 295 532 │ RatchetFreak1 4 045 307 │ ArrayOfCharFromArrayOfChar 2 790 178 │ RatchetFreak2EdStaub1GreyCat2 2 583 311 │ RatchetFreak2 1 274 859 │ StringBuilderChar 1 138 174 │ StringBuilderCodePoint   994 727 │ ArrayOfByteUTF8String   918 611 │ ArrayOfByteUTF8Const   756 086 │ MatcherReplace   598 945 │ StringReplaceAll   460 045 │ ArrayOfByteWindows1251

In charted form: Same single string chart
_{(source: greycat.ru)}

Multiple strings, 100% of strings contain control characters

Source string provider pre-generated lots of random strings using (0..127) character set - thus almost all strings contained at least one control character. Algorithms received strings from this pre-generated array in round-robin fashion.

  Ops / s  │ Algorithm ──────────┼────────────────────────────── 2 123 142 │ Voo1 ──────────┼────────────────────────────── 1 782 214 │ EdStaub1 1 776 199 │ EdStaub1GreyCat1 1 694 628 │ ArrayOfCharFromStringCharAt 1 481 481 │ ArrayOfCharFromArrayOfChar 1 460 067 │ RatchetFreak2EdStaub1GreyCat1 1 438 435 │ RatchetFreak2EdStaub1GreyCat2 1 366 494 │ RatchetFreak2 1 349 710 │ RatchetFreak1   893 176 │ ArrayOfByteUTF8String   817 127 │ ArrayOfByteUTF8Const   778 089 │ StringBuilderChar   734 754 │ StringBuilderCodePoint   377 829 │ ArrayOfByteWindows1251   224 140 │ MatcherReplace   211 104 │ StringReplaceAll

In charted form: Multiple strings, 100% concentration
_{(source: greycat.ru)}

Multiple strings, 1% of strings contain control characters

Same as previous, but only 1% of strings was generated with control characters - other 99% was generated in using [32..127] character set, so they couldn't contain control characters at all. This synthetic load comes the closest to real world application of this algorithm at my place.

  Ops / s  │ Algorithm ──────────┼────────────────────────────── 3 711 952 │ Voo1 ──────────┼────────────────────────────── 2 851 440 │ EdStaub1GreyCat1 2 455 796 │ EdStaub1 2 426 007 │ ArrayOfCharFromStringCharAt 2 347 969 │ RatchetFreak2EdStaub1GreyCat2 2 242 152 │ RatchetFreak1 2 171 553 │ ArrayOfCharFromArrayOfChar 1 922 707 │ RatchetFreak2EdStaub1GreyCat1 1 857 010 │ RatchetFreak2 1 023 751 │ ArrayOfByteUTF8String   939 055 │ StringBuilderChar   907 194 │ ArrayOfByteUTF8Const   841 963 │ StringBuilderCodePoint   606 465 │ MatcherReplace   501 555 │ StringReplaceAll   381 185 │ ArrayOfByteWindows1251

In charted form: Multiple strings, 1% concentration
_{(source: greycat.ru)}

It's very hard for me to decide on who provided the best answer, but given the real-world application best solution was given/inspired by Ed Staub, I guess it would be fair to mark his answer. Thanks for all who took part in this, your input was very helpful and invaluable. Feel free to run the test suite on your box and propose even better solutions (working JNI solution, anyone?).

References

GitHub repository with a benchmarking suite

248

asked Aug 23 '11 13:08

GreyCat

2 Answers

using 1 char array could work a bit better

int length = s.length(); char[] oldChars = new char[length]; s.getChars(0, length, oldChars, 0); int newLen = 0; for (int j = 0; j < length; j++) {     char ch = oldChars[j];     if (ch >= ' ') {         oldChars[newLen] = ch;         newLen++;     } } s = new String(oldChars, 0, newLen);

and I avoided repeated calls to s.length();

another micro-optimization that might work is

int length = s.length(); char[] oldChars = new char[length+1]; s.getChars(0, length, oldChars, 0); oldChars[length]='\0';//avoiding explicit bound check in while int newLen=-1; while(oldChars[++newLen]>=' ');//find first non-printable,                        // if there are none it ends on the null char I appended for (int  j = newLen; j < length; j++) {     char ch = oldChars[j];     if (ch >= ' ') {         oldChars[newLen] = ch;//the while avoids repeated overwriting here when newLen==j         newLen++;     } } s = new String(oldChars, 0, newLen);

176

answered Oct 06 '22 10:10

ratchet freak

If it is reasonable to embed this method in a class which is not shared across threads, then you can reuse the buffer:

char [] oldChars = new char[5];  String stripControlChars(String s) {     final int inputLen = s.length();     if ( oldChars.length < inputLen )     {         oldChars = new char[inputLen];     }     s.getChars(0, inputLen, oldChars, 0);

etc...

This is a big win - 20% or so, as I understand the current best case.

If this is to be used on potentially large strings and the memory "leak" is a concern, a weak reference can be used.