Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spilt String using Unicode delimiter

I need to split a string with "-" as delimiter in java. Ex: "Single Room - Enjoy your stay"

I have the same data coming in english and german depending on locale . Hence I cannot use the usual string.split("-") . The unicode for "-" character is 8212(dec) or x2014(hex).How do I split the string using unicode ???

like image 935
Bhavya Avatar asked Mar 08 '12 04:03

Bhavya


People also ask

How do you split a string with a delimiter?

Using String. split() Method. The split() method of the String class is used to split a string into an array of String objects based on the specified delimiter that matches the regular expression.

How do you split a Unicode string in Python?

Practical Data Science using Python Unicode string can be split, and byte offset can be specified using the 'unicode_split' method and the 'unicode_decode_with_offsets'methods respectively. These methods are present in the 'string' class of 'tensorflow' module.

What is Unicode string example?

Encodings. To summarize the previous section: a Unicode string is a sequence of code points, which are numbers from 0 through 0x10FFFF (1,114,111 decimal). This sequence of code points needs to be represented in memory as a set of code units, and code units are then mapped to 8-bit bytes.

What split str?

Definition and Usage The split() method splits a string into an array of substrings. The split() method returns the new array. The split() method does not change the original string. If (" ") is used as separator, the string is split between words.


1 Answers

You may be mistaken in which Unicode dash character you’re getting. As of Unicode v6.1, there are 27 code points that have the \p{Dash} property:

U+002D  -  HYPHEN-MINUS
U+058A  ֊  ARMENIAN HYPHEN
U+05BE  ־  HEBREW PUNCTUATION MAQAF
U+1400    CANADIAN SYLLABICS HYPHEN
U+1806    MONGOLIAN TODO SOFT HYPHEN
U+2010    HYPHEN
U+2011    NON-BREAKING HYPHEN
U+2012    FIGURE DASH
U+2013    EN DASH
U+2014    EM DASH
U+2015    HORIZONTAL BAR
U+2053    SWUNG DASH
U+207B    SUPERSCRIPT MINUS
U+208B    SUBSCRIPT MINUS
U+2212    MINUS SIGN
U+2E17    DOUBLE OBLIQUE HYPHEN
U+2E1A    HYPHEN WITH DIAERESIS
U+2E3A    TWO-EM DASH
U+2E3B    THREE-EM DASH
U+301C   WAVE DASH
U+3030   WAVY DASH
U+30A0   KATAKANA-HIRAGANA DOUBLE HYPHEN
U+FE31   PRESENTATION FORM FOR VERTICAL EM DASH
U+FE32   PRESENTATION FORM FOR VERTICAL EN DASH
U+FE58   SMALL EM DASH
U+FE63   SMALL HYPHEN-MINUS
U+FF0D   FULLWIDTH HYPHEN-MINUS

In Perl or ICU, you could just split directly on \p{dash}, but since the Sun Pattern class doesn’t support full Unicode properties like that, you have to synthesize it with an enumerated square-bracketed character class. So splitting on the pattern:

string.split("[\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2053\u207B\u208B\u2212\u2E17\u2E1A\u2E3A-\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]")

should do the trick for you. You can actually double-backslash those if you fear for the Java preprocessor getting in your way, because the regex parser should know to understand the alternate notation.

like image 127
tchrist Avatar answered Oct 05 '22 15:10

tchrist