Handling \u200b (Zero width space) character in text preprocessing for NLP task

Question

I'm preprocessing some text for a NER model I'm training, and I'm encountering this character quite a lot. This character is not removed with strip():

>>> 'Hello world!\u200b'.strip()
'Hello world!\u200b'

It is not considered a whitespace for regular expressions:

>>> re.sub('\s+', ' ', "hello\u200bworld!")
'hello\u200bworld!'

and spaCy's tokenizer does not split tokens upon it:

>>> [t.text for t in nlp("hello\u200bworld!")]
['hello\u200bworld', '!']

So, how should I handle it? I can simply replace it, however I don't want to make a special case for this character, but rather replace all characters with similar characteristics.

Thanks.

ETL · Accepted Answer

How about simply doing string replace before NLP?

'Hello world!\u200b'.replace('\u200b', ' ').strip()

Handling \u200b (Zero width space) character in text preprocessing for NLP task

Tags:

python

removing-whitespace

nlp

spacy

Gino

1 Answers

ETL

Recent Activity

Donate For Us

Handling \u200b (Zero width space) character in text preprocessing for NLP task

Tags:

python

removing-whitespace

nlp

spacy

Gino

1 Answers

ETL

Related questions

Recent Activity

Donate For Us