Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove all spaces between Chinese words with regex

I would like to remove all spaces among Chinese text only.

My text: "請 把 這 裡 的 10 多 個 字 合 併. Can you help me?"

Ideal output: "請把這裡的 10 多個字合併. Can you help me?"

var str = '請 把 這 裡 的 10 多 個 字 合 併. Can you help me?'; str = str.replace("/\ /", ""); 

I have studied a similar question for Python but it seems not to work in my situation so I brought my question here for some help.

like image 583
lewishole Avatar asked Jan 14 '19 09:01

lewishole


People also ask

How do you remove spaces between words in regex?

Java regex remove spaces In Java, we can use regex \\s+ to match whitespace characters, and replaceAll("\\s+", " ") to replace them with a single space.

How do you check for space in regex?

The RegExp \s Metacharacter in JavaScript is used to find the whitespace characters. The whitespace character can be a space/tab/new line/vertical character. It is same as [ \t\n\r].

Does regex include space?

Yes, also your regex will match if there are just spaces.


1 Answers

Getting to the Chinese char matching pattern

Using the Unicode Tools, the \p{Han} Unicode property class that matches any Chinese char can be translated into

[\u2E80-\u2E99\u2E9B-\u2EF3\u2F00-\u2FD5\u3005\u3007\u3021-\u3029\u3038-\u303B\u3400-\u4DB5\u4E00-\u9FEF\uF900-\uFA6D\uFA70-\uFAD9\U00020000-\U0002A6D6\U0002A700-\U0002B734\U0002B740-\U0002B81D\U0002B820-\U0002CEA1\U0002CEB0-\U0002EBE0\U0002F800-\U0002FA1D] 

In ES6, to match a single Chinese char, it can be used as

/[\u2E80-\u2E99\u2E9B-\u2EF3\u2F00-\u2FD5\u3005\u3007\u3021-\u3029\u3038-\u303B\u3400-\u4DB5\u4E00-\u9FEF\uF900-\uFA6D\uFA70-\uFAD9\u{20000}-\u{2A6D6}\u{2A700}-\u{2B734}\u{2B740}-\u{2B81D}\u{2B820}-\u{2CEA1}\u{2CEB0}-\u{2EBE0}\u{2F800}-\u{2FA1D}]/u 

Transpiling it to ES5 using ES2015 Unicode regular expression transpiler, we get

(?:[\u2E80-\u2E99\u2E9B-\u2EF3\u2F00-\u2FD5\u3005\u3007\u3021-\u3029\u3038-\u303B\u3400-\u4DB5\u4E00-\u9FEF\uF900-\uFA6D\uFA70-\uFAD9]|[\uD840-\uD868\uD86A-\uD86C\uD86F-\uD872\uD874-\uD879][\uDC00-\uDFFF]|\uD869[\uDC00-\uDED6\uDF00-\uDFFF]|\uD86D[\uDC00-\uDF34\uDF40-\uDFFF]|\uD86E[\uDC00-\uDC1D\uDC20-\uDFFF]|\uD873[\uDC00-\uDEA1\uDEB0-\uDFFF]|\uD87A[\uDC00-\uDFE0]|\uD87E[\uDC00-\uDE1D]) 

pattern to match any Chinese char using JS RegExp.

So, you may use

s.replace(/([\u2E80-\u2E99\u2E9B-\u2EF3\u2F00-\u2FD5\u3005\u3007\u3021-\u3029\u3038-\u303B\u3400-\u4DB5\u4E00-\u9FEF\uF900-\uFA6D\uFA70-\uFAD9]|[\uD840-\uD868\uD86A-\uD86C\uD86F-\uD872\uD874-\uD879][\uDC00-\uDFFF]|\uD869[\uDC00-\uDED6\uDF00-\uDFFF]|\uD86D[\uDC00-\uDF34\uDF40-\uDFFF]|\uD86E[\uDC00-\uDC1D\uDC20-\uDFFF]|\uD873[\uDC00-\uDEA1\uDEB0-\uDFFF]|\uD87A[\uDC00-\uDFE0]|\uD87E[\uDC00-\uDE1D])\s+(?=(?:[\u2E80-\u2E99\u2E9B-\u2EF3\u2F00-\u2FD5\u3005\u3007\u3021-\u3029\u3038-\u303B\u3400-\u4DB5\u4E00-\u9FEF\uF900-\uFA6D\uFA70-\uFAD9]|[\uD840-\uD868\uD86A-\uD86C\uD86F-\uD872\uD874-\uD879][\uDC00-\uDFFF]|\uD869[\uDC00-\uDED6\uDF00-\uDFFF]|\uD86D[\uDC00-\uDF34\uDF40-\uDFFF]|\uD86E[\uDC00-\uDC1D\uDC20-\uDFFF]|\uD873[\uDC00-\uDEA1\uDEB0-\uDFFF]|\uD87A[\uDC00-\uDFE0]|\uD87E[\uDC00-\uDE1D]))/g, '$1') 

See the regex demo.

If your JS environment is ECMAScript 2018 compliant you may use a shorter

s.replace(/(\p{Script=Hani})\s+(?=\p{Script=Hani})/gu, '$1') 

Pattern details

  • (CHINESE_CHAR_PATTERN) - Capturing group 1 ($1 in the replacement pattern): any Chinese char
  • \s+ - any 1+ whitespaces (any Unicode whitespace)
  • (?=CHINESE_CHAR_PATTERN) - there must be a Chinese char immediately to the right of the current location.

JS demo:

var s = "請 把 這 裡 的 10 多 個 字 合 併. Can you help me?";  var HanChr = "[\\u2E80-\\u2E99\\u2E9B-\\u2EF3\\u2F00-\\u2FD5\\u3005\\u3007\\u3021-\\u3029\\u3038-\\u303B\\u3400-\\u4DB5\\u4E00-\\u9FEF\\uF900-\\uFA6D\\uFA70-\\uFAD9]|[\\uD840-\\uD868\\uD86A-\\uD86C\\uD86F-\\uD872\\uD874-\\uD879][\\uDC00-\\uDFFF]|\\uD869[\\uDC00-\\uDED6\\uDF00-\\uDFFF]|\\uD86D[\\uDC00-\\uDF34\\uDF40-\\uDFFF]|\\uD86E[\\uDC00-\\uDC1D\\uDC20-\\uDFFF]|\\uD873[\\uDC00-\\uDEA1\\uDEB0-\\uDFFF]|\\uD87A[\\uDC00-\\uDFE0]|\\uD87E[\\uDC00-\\uDE1D]";   console.log(s.replace(new RegExp('(' + HanChr + ')\\s+(?=(?:' + HanChr + '))', 'g'), '$1'));

A test for the regex compliant with the ECMAScript 2018 standard:

var s = "請 把 這 裡 的 10 多 個 字 合 併. Can you help me?";  console.log(s.replace(/(\p{Script=Hani})\s+(?=\p{Script=Hani})/gu, '$1'));
like image 196
Wiktor Stribiżew Avatar answered Oct 06 '22 09:10

Wiktor Stribiżew