I need to split a string with "-" as delimiter in java. Ex: "Single Room - Enjoy your stay"
I have the same data coming in english and german depending on locale . Hence I cannot use the usual string.split("-") . The unicode for "-" character is 8212(dec) or x2014(hex).How do I split the string using unicode ???
Using String. split() Method. The split() method of the String class is used to split a string into an array of String objects based on the specified delimiter that matches the regular expression.
Practical Data Science using Python Unicode string can be split, and byte offset can be specified using the 'unicode_split' method and the 'unicode_decode_with_offsets'methods respectively. These methods are present in the 'string' class of 'tensorflow' module.
Encodings. To summarize the previous section: a Unicode string is a sequence of code points, which are numbers from 0 through 0x10FFFF (1,114,111 decimal). This sequence of code points needs to be represented in memory as a set of code units, and code units are then mapped to 8-bit bytes.
Definition and Usage The split() method splits a string into an array of substrings. The split() method returns the new array. The split() method does not change the original string. If (" ") is used as separator, the string is split between words.
You may be mistaken in which Unicode dash character you’re getting. As of Unicode v6.1, there are 27 code points that have the \p{Dash}
property:
U+002D - HYPHEN-MINUS
U+058A ֊ ARMENIAN HYPHEN
U+05BE ־ HEBREW PUNCTUATION MAQAF
U+1400 ᐀ CANADIAN SYLLABICS HYPHEN
U+1806 ᠆ MONGOLIAN TODO SOFT HYPHEN
U+2010 ‐ HYPHEN
U+2011 ‑ NON-BREAKING HYPHEN
U+2012 ‒ FIGURE DASH
U+2013 – EN DASH
U+2014 — EM DASH
U+2015 ― HORIZONTAL BAR
U+2053 ⁓ SWUNG DASH
U+207B ⁻ SUPERSCRIPT MINUS
U+208B ₋ SUBSCRIPT MINUS
U+2212 − MINUS SIGN
U+2E17 ⸗ DOUBLE OBLIQUE HYPHEN
U+2E1A ⸚ HYPHEN WITH DIAERESIS
U+2E3A ⸺ TWO-EM DASH
U+2E3B ⸻ THREE-EM DASH
U+301C 〜 WAVE DASH
U+3030 〰 WAVY DASH
U+30A0 ゠ KATAKANA-HIRAGANA DOUBLE HYPHEN
U+FE31 ︱ PRESENTATION FORM FOR VERTICAL EM DASH
U+FE32 ︲ PRESENTATION FORM FOR VERTICAL EN DASH
U+FE58 ﹘ SMALL EM DASH
U+FE63 ﹣ SMALL HYPHEN-MINUS
U+FF0D - FULLWIDTH HYPHEN-MINUS
In Perl or ICU, you could just split directly on \p{dash}
, but since the Sun Pattern
class doesn’t support full Unicode properties like that, you have to synthesize it with an enumerated square-bracketed character class. So splitting on the pattern:
string.split("[\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2053\u207B\u208B\u2212\u2E17\u2E1A\u2E3A-\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]")
should do the trick for you. You can actually double-backslash those if you fear for the Java preprocessor getting in your way, because the regex parser should know to understand the alternate notation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With