Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to split a text using regex, but the splitted words continue keeping the regex separator?

I have a text and using this simple regex to split it in words: [ \n]. It splits the text into words using spaces and line-breaks.

I want to know if there is a way to keep the whitespace or the line-break in the splited word, because I will use this to a simple sentence detection after some processing.

I'm using the String#split method.

like image 423
Renato Dinhani Avatar asked Aug 17 '11 16:08

Renato Dinhani


1 Answers

You can use lookbehind as @Piotr Findeisen suggested (+1):

public class RegexExample{
    public static void main(String[] args) {
    String s = "firstWordWithSpaceAfter secondWordWithSpaceAfter wordWithLineBreakAfter\nlastWord";
    String sa[] = s.split("(?<=[ \\n])");
    for (String saa : sa )
        System.out.println("[" + saa + "]");
    }
}

Output:

[firstWordWithSpaceAfter ]
[secondWordWithSpaceAfter ]
[wordWithLineBreakAfter
]
[lastWord]

Short explanation:

?<= is look behind, meaning you got a match if the data before the expression you are looking for is equal to the regex coming after ?<= (in this case [ \\n])

[ \\n] is regex that means one of the characters in the []

so the whole regex says split every time that the character before the expression / word is either space or \n.

Since we didn't try to match space or \n, it will not remove them.

like image 177
MByD Avatar answered Oct 23 '22 22:10

MByD