Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

General method to trim non-printable characters in Clojure

I encountered a bug where I couldn't match two seemingly 'identical' strings together. For example, the following two strings fail to match: "sample" and "​sample".

To replicate the issue, one can run the following in Clojure.

(= "sample" "​sample") ; returns false

After an hour of frustrated debugging, I discovered that there was a zero-width space at the front of the second string! Removing it from this particular example via a backspace is trivial. However I have a database of strings that I'm matching, and it seems like there are multiple strings facing this issue. My question is: is there a general method to trim zero-width spaces in Clojure?

Some method's I've tried:

(count (clojure.string/trim "​abc")) ; returns 4
(count (clojure.string/replace "​abc" #"\s" "")) ; returns 4

This thread Remove zero-width space characters from a JavaScript string does provide a solution with regular expressions that works in this example, i.e.

(count (clojure.string/replace "​abc" #"[\u200B-\u200D\uFEFF]" "")) ; returns 3

However, as stated in the post itself, there are many other potential ascii characters that may be invisible. So I'm still interested if there's a more general method that doesn't rely on listing all possible invisible unicode symbols.

like image 872
Desmond Cheong Avatar asked Jul 15 '20 11:07

Desmond Cheong


1 Answers

I believe, what you are referring to are so-called non-printable characters. Based on this answer in Java, you could pass the #"\p{C}" regular expression as pattern to replace:

(defn remove-non-printable-characters [x]
  (clojure.string/replace x #"\p{C}" ""))

However, this will remove line breaks, e.g. \n. So in order to keep those characters, we need a more complex regular expression:

(defn remove-non-printable-characters [x]
  (clojure.string/replace x #"[\p{C}&&^(\S)]" ""))

This function will remove non-printable characters. Let's test it:

(= "sample" "​sample")
;; => false

(= (remove-non-printable-characters "sample")
   (remove-non-printable-characters "​sample"))
;; => true

(remove-non-printable-characters "sam\nple")
;; => "sam\nple"

The \p{C} pattern is discussed here.

like image 71
Rulle Avatar answered Oct 11 '22 13:10

Rulle