Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using streams to manipulate a String

Tags:

Let's say that I want to remove all the non-letters from my String.

String s = "abc-de3-2fg"; 

I can use an IntStream in order to do that:

s.stream().filter(ch -> Character.isLetter(ch)).  // But then what? 

What can I do in order to convert this stream back to a String instance?

On a different note, why can't I treat a String as a stream of objects of type Character?

String s = "abc-de3-2fg";  // Yields a Stream of char[], therefore doesn't compile Stream<Character> stream = Stream.of(s.toCharArray());  // Yields a stream with one member - s, which is a String object. Doesn't compile Stream<Character> stream = Stream.of(s); 

According to the javadoc, the Stream's creation signature is as follows:

Stream.of(T... values)

The only (lousy) way that I could think of is:

String s = "abc-de3-2fg"; Stream<Character> stream = Stream.of(s.charAt(0), s.charAt(1), s.charAt(2), ...) 

And of course, this isn't good enough... What am I missing?

like image 265
KidCrippler Avatar asked Aug 12 '15 23:08

KidCrippler


1 Answers

Here's an answer the second part of the question. If you have an IntStream resulting from calling string.chars() you can get a Stream<Character> by casting to char and then boxing the result by calling mapToObj. For example, here's how to turn a String into a Set<Character>:

Set<Character> set = string.chars()     .mapToObj(ch -> (char)ch)     .collect(Collectors.toSet()); 

Note that casting to char is essential for the boxed result to be Character instead of Integer.

Now the big problem with dealing with char or Character data is that supplementary characters are represented as surrogate pairs of char values, so any algorithm with deals with individual char values will probably fail when presented with supplementary characters.

(It may seem like supplementary characters are an obscure Unicode feature that we don't need to worry about, but as far as I know, all emoji are supplementary characters.)

Consider this example:

string.chars()       .filter(Character::isAlphabetic)       ... 

This will fail if presented with a string that contains the code point U+1D400 (Mathematical Bold Capital A). That code point is represented as a surrogate pair in the string, and neither value of a surrogate pair is an alphabetic character. To get the correct result, you'd need to do this instead:

string.codePoints()       .filter(Character::isAlphabetic)       ... 

I recommend always using codePoints().

Now, given an IntStream of code points, how can one reassemble it into a String? Sleiman Jneidi's answer is a reasonable one (+1), using the three-arg collect() method of IntStream.

Here's an alternative:

StringBuilder sb = ... ; string.codePoints()       .filter(...)       .forEachOrdered(sb::appendCodePoint); return sb.toString(); 

This might be a bit more flexible, in cases where you already have a StringBuilder that you're using to accumulate string data. You don't have to create a new StringBuilder each time, nor do you have to convert it to a String afterwards.

like image 107
Stuart Marks Avatar answered Oct 25 '22 00:10

Stuart Marks