Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spilt String using Unicode delimiter

I need to split a string with "-" as delimiter in java. Ex: "Single Room - Enjoy your stay"

I have the same data coming in english and german depending on locale . Hence I cannot use the usual string.split("-") . The unicode for "-" character is 8212(dec) or x2014(hex).How do I split the string using unicode ???

like image 935
Bhavya Avatar asked Mar 08 '12 04:03

Bhavya


People also ask

How do you split a string with a delimiter?

Using String. split() Method. The split() method of the String class is used to split a string into an array of String objects based on the specified delimiter that matches the regular expression.

How do you split a Unicode string in Python?

Practical Data Science using Python Unicode string can be split, and byte offset can be specified using the 'unicode_split' method and the 'unicode_decode_with_offsets'methods respectively. These methods are present in the 'string' class of 'tensorflow' module.

What is Unicode string example?

Encodings. To summarize the previous section: a Unicode string is a sequence of code points, which are numbers from 0 through 0x10FFFF (1,114,111 decimal). This sequence of code points needs to be represented in memory as a set of code units, and code units are then mapped to 8-bit bytes.

What split str?

Definition and Usage The split() method splits a string into an array of substrings. The split() method returns the new array. The split() method does not change the original string. If (" ") is used as separator, the string is split between words.


1 Answers

You may be mistaken in which Unicode dash character you’re getting. As of Unicode v6.1, there are 27 code points that have the \p{Dash} property:

U+002D ‭ -  HYPHEN-MINUS
U+058A ‭ ֊  ARMENIAN HYPHEN
U+05BE ‭ ־  HEBREW PUNCTUATION MAQAF
U+1400 ‭ ᐀  CANADIAN SYLLABICS HYPHEN
U+1806 ‭ ᠆  MONGOLIAN TODO SOFT HYPHEN
U+2010 ‭ ‐  HYPHEN
U+2011 ‭ ‑  NON-BREAKING HYPHEN
U+2012 ‭ ‒  FIGURE DASH
U+2013 ‭ –  EN DASH
U+2014 ‭ —  EM DASH
U+2015 ‭ ―  HORIZONTAL BAR
U+2053 ‭ ⁓  SWUNG DASH
U+207B ‭ ⁻  SUPERSCRIPT MINUS
U+208B ‭ ₋  SUBSCRIPT MINUS
U+2212 ‭ −  MINUS SIGN
U+2E17 ‭ ⸗  DOUBLE OBLIQUE HYPHEN
U+2E1A ‭ ⸚  HYPHEN WITH DIAERESIS
U+2E3A ‭ ⸺  TWO-EM DASH
U+2E3B ‭ ⸻  THREE-EM DASH
U+301C ‭ 〜 WAVE DASH
U+3030 ‭ 〰 WAVY DASH
U+30A0 ‭ ゠ KATAKANA-HIRAGANA DOUBLE HYPHEN
U+FE31 ‭ ︱ PRESENTATION FORM FOR VERTICAL EM DASH
U+FE32 ‭ ︲ PRESENTATION FORM FOR VERTICAL EN DASH
U+FE58 ‭ ﹘ SMALL EM DASH
U+FE63 ‭ ﹣ SMALL HYPHEN-MINUS
U+FF0D ‭ - FULLWIDTH HYPHEN-MINUS

In Perl or ICU, you could just split directly on \p{dash}, but since the Sun Pattern class doesn’t support full Unicode properties like that, you have to synthesize it with an enumerated square-bracketed character class. So splitting on the pattern:

string.split("[\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2053\u207B\u208B\u2212\u2E17\u2E1A\u2E3A-\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]")

should do the trick for you. You can actually double-backslash those if you fear for the Java preprocessor getting in your way, because the regex parser should know to understand the alternate notation.

like image 127
tchrist Avatar answered Oct 05 '22 15:10

tchrist