Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java split is eating my characters

I have a string like this String str = "la$le\\$li$lo".

I want to split it to get the following output "la","le\\$li","lo". The \$ is a $ escaped so it should be left in the output.

But when I do str.split("[^\\\\]\\$") y get "l","le\\$l","lo".

From what I get my regex is matching a$ and i$ and removing then. Any idea of how to get my characters back?

Thanks

like image 448
Fenris_uy Avatar asked May 12 '10 14:05

Fenris_uy


People also ask

How do you split a string but keep the delimiters?

Summary: To split a string and keep the delimiters/separators you can use one of the following methods: Use a regex module and the split() method along with \W special character. Use a regex module and the split() method along with a negative character set [^a-zA-Z0-9] .

What does split \\ s+ do in Java?

split("\\s+") will split the string into string of array with separator as space or multiple spaces. \s+ is a regular expression for one or more spaces.

Why split is not working for dot in Java?

backslash-dot is invalid because Java doesn't need to escape the dot. You've got to escape the escape character to get it as far as the regex which is used to split the string.

How do you split special characters?

To split a string by special characters, call the split() method on the string, passing it a regular expression that matches any of the special characters as a parameter. The method will split the string on each occurrence of a special character and return an array containing the results. Copied!


3 Answers

Use zero-width matching assertions:

    String str = "la$le\\$li$lo";
    System.out.println(java.util.Arrays.toString(
        str.split("(?<!\\\\)\\$")
    )); // prints "[la, le\$li, lo]"

The regex is essentially

(?<!\\)\$

It uses negative lookbehind to assert that there is not a preceding \.

See also

  • regular-expressions.info/Lookarounds

More examples of splitting on assertions

Simple sentence splitting, keeping punctuation marks:

    String str = "Really?Wow!This.Is.Awesome!";
    System.out.println(java.util.Arrays.toString(
        str.split("(?<=[.!?])")
    )); // prints "[Really?, Wow!, This., Is., Awesome!]"

Splitting a long string into fixed-length parts, using \G

    String str = "012345678901234567890";
    System.out.println(java.util.Arrays.toString(
        str.split("(?<=\\G.{4})")
    )); // prints "[0123, 4567, 8901, 2345, 6789, 0]"

Using a lookbehind/lookahead combo:

    String str = "HelloThereHowAreYou";
    System.out.println(java.util.Arrays.toString(
        str.split("(?<=[a-z])(?=[A-Z])")
    )); // prints "[Hello, There, How, Are, You]"

Related questions

  • Can you use zero-width matching regex in String split?
  • Backreferences in lookbehind
  • How do I convert CamelCase into human-readable names in Java?
like image 56
polygenelubricants Avatar answered Oct 23 '22 12:10

polygenelubricants


The reason a$ and i$ are getting removed is that the regexp [^\\]\$ matches any character that is not '\' followed by '$'. You need to use zero width assertions

This is the same problem people have trying to find q not followed by u.

A first cut at the proper regexp is /(?<!\\)\$/ ( "(?<!\\\\)\\$" in java )

class Test {
 public static void main(String[] args) {
  String regexp = "(?<!\\\\)\\$";
  System.out.println( java.util.Arrays.toString( "1a$1e\\$li$lo".split(regexp) ) );
 }
}

Yields:
[1a, 1e\$li, lo]

like image 22
KitsuneYMG Avatar answered Oct 23 '22 10:10

KitsuneYMG


You can try first replacing "\$" with another string, such as the URL Encoding for $ ("%24"), and then splitting:

String splits[] = str.replace("\$","%24").split("[^\\\\]\\$");
for(String str : splits){
   str = str.replace("%24","\$");
}

More generally, if str is constructed by something like

str = a + "$" + b + "$" + c

Then you can URLEncode a, b and c before appending them together

import java.net.URLEncoder.encode;
...
str = encode(a) + "$" + encode(b) + "$" + encode(c)
like image 36
James Kingsbery Avatar answered Oct 23 '22 11:10

James Kingsbery