Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split String with regex \w \w*? \w+?

Tags:

java

regex

I'm learning regexp and thought I was starting to get a grip. but then...

I tried to split a string and I need help to understand such a simple thing as:

String input = "abcde";
System.out.println("[a-z] " + Arrays.toString(input.split("[a-z]")));
System.out.println("\\w " + Arrays.toString(input.split("\\w")));
System.out.println("\\w*? " + Arrays.toString(input.split("\\w*?")));
System.out.println("\\w+? " + Arrays.toString(input.split("\\w+?")));

The output is
[a-z] - []
\w    - []
\w*?  - [, a, b, c, d, e]
\w+?  - []

Why doesn't any of the two first lines split the String on any character? The third expression \w*?, (question mark prevents greediness) works as I expected, splitting the String on every character. The star, zero or more matches, returns an empty array.

I've tried the expression within NotePad++ and in a program and it shows 5 matches as in:

Scanner ls = new Scanner(input);
while(ls.hasNext())
    System.out.format("%s ", ls.findInLine("\\w");

Output is: a b c d e

This really puzzles me

like image 930
Kennet Avatar asked Mar 18 '12 18:03

Kennet


People also ask

How split a string in regex?

To split a string by a regular expression, pass a regex as a parameter to the split() method, e.g. str. split(/[,. \s]/) . The split method takes a string or regular expression and splits the string based on the provided separator, into an array of substrings.

What is \\ w+ in Java regex?

A succinct version: \\w+ matches all alphanumeric characters and _ . \\W+ matches all characters except alphanumeric characters and _ . They are opposite.

Can we use regex in Split?

You do not only have to use literal strings for splitting strings into an array with the split method. You can use regex as breakpoints that match more characters for splitting a string.

How do you split a string in regex in Python?

If you want to split a string that matches a regular expression (regex) instead of perfect match, use the split() of the re module. In re. split() , specify the regex pattern in the first parameter and the target character string in the second parameter.

What is the use of splitter in regex?

Splits an input string into an array of substrings at the positions defined by a specified regular expression pattern. Specified options modify the matching operation. Splits an input string a specified maximum number of times into an array of substrings, at the positions defined by a regular expression specified in the Regex constructor.

How to split string into words with multiple word boundary delimiters using regex?

Regex to split String into words with multiple word boundary delimiters. In this example, we will use the[\b\W\b]+ regex pattern to cater to any Non-alphanumeric delimiters. Using this pattern we can split string by multiple word boundary delimiters that will result in a list of alphanumeric/word tokens.

Why is regex better than string split in C?

Regex splits the string based on a pattern. It handles a delimiter specified as a pattern. This is why Regex is better than string.Split. Here are some examples of how to split a string using Regex in C#.

How to split a string between words in JavaScript?

If (" ") is used as separator, the string is split between words. Optional. The character (or regular expression) to use for splitting. If omitted, an array with the original string is returned.


1 Answers

If you split a string with a regex, you essentially tell where the string should be cut. This necessarily cuts away what you match with the regex. Which means if you split at \w, then every character is a split point and the substrings between them (all empty) are returned. Java automatically removes trailing empty strings, as described in the documentation.

This also explains why the lazy match \w*? will give you every character, because it will match every position between (and before and after) any character (zero-width). What's left are the characters of the string themselves.

Let's break it down:

  1. [a-z], \w, \w+?

    Your string is

    abcde
    

    And the matches are as follows:

     a  b  c  d  e
    └─┘└─┘└─┘└─┘└─┘
    

    which leaves you with the substrings between the matches, all of which are empty.

    The above three regexes behave the same in this regard as they all will only match a single character. \w+? will do so because it lacks any other constraints that might make the +? try matching more than just the bare minimum (it's lazy, after all).

  2. \w*?

      a  b  c  d  e
    └┘ └┘ └┘ └┘ └┘ └┘
    

    In this case the matches are between the characters, leaving you with the following substrings:

    "", "a", "b", "c", "d", "e", ""
    

    Java throws the trailing empty one away, though.

like image 60
Joey Avatar answered Oct 21 '22 23:10

Joey