I did a question about punctuation and regex, but it was confusing.
Supossing I have this text:
String text = "wor.d1, :word2. wo,rd3? word4!";
I'm doing this:
String parts[] = text.split(" ");
And I have this:
wor.d1, | :word2. | wor,d3? | word4!;
What I need to do to have this? (Keep the the symbols at the borders, but only that I specify: .,!?:
, not all).
wor,d1 | , | : | word2 | . | wor,d3 | ? | word4 | !
I'm getting some good results with these regex, but it's giving an empty char before all splits on punctuation at start of a word.
There is a way to not have this empty char at the start?
Is this regex is good, or there is a more simple way?
public static final String PUNCTUATION_SEPARATOR =
"("
+ "("
+ "(?=^[\"'!?.,;:(){}\\[\\]]+)"
+ "|"
+ "(?<=^[\"'!?.,;:(){}\\[\\]]+)"
+ ")"
+ "|"
+ "("
+ "(?=[\"'!?.,;:(){}\\[\\]]+($|\n))"
+ "|"
+ "(?<=[\"'!?.,;:(){}\\[\\]]+($|\n))"
+ ")"
+ ")";
Delimiters. The first element of a regular expression is the delimiters. These are the boundaries of your regular expressions. The most common delimiter that you'll see with regular expressions is the slash ( / ) or forward slash.
split(String regex) method splits this string around matches of the given regular expression. This method works in the same way as invoking the method i.e split(String regex, int limit) with the given expression and a limit argument of zero. Therefore, trailing empty strings are not included in the resulting array.
You can use the split() method of String class from JDK to split a String based on a delimiter e.g. splitting a comma-separated String on a comma, breaking a pipe-delimited String on a pipe, or splitting a pipe-delimited String on a pipe.
Are you sure you want to use regex ? There's a faster implementation for splitting by single char: StringTokenizer. And it that can return the delimiters.
String str= "word1, word2. word3? word4!";
String delim = ",.!?";
StringTokenizer st = new StringTokenizer(str, delim, true);
while (st.hasMoreTokens()) {
String token = st.nextToken();
... // token will be: "word1", ",", " word2", ".", etc...
}
For simple separators I recommend the StringTokenizer. But here's a solution using regex and another auxiliary separator:
String s = "one,two, three four , five";
s = s.replaceAll("([,\\s]+)", "#$1#");
Pattern p = Pattern.compile("#");
String[] result = p.split(s);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With