Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to handle with words which have space between characters?

I am using nltk.word_tokenize in Dari language. The problem is that we have space between one word.
For example the word "زنده گی" which means life. And the same; we have many other words. All words which end with the character "ه" we have to give a space for it, otherwise, it can be combined such as "زندهگی".

Can anyone help me using [tag:regex] or any other way that should not tokenize the words that a part of one word ends with "ه" and after that, there will be the "گ " character.

like image 873
The Afghan Avatar asked Sep 20 '17 09:09

The Afghan


People also ask

How do I fix gaps in Word?

Set your cursor to the location of the paragraph spacing. Click on the Line and Paragraph Spacing icon in the Home Ribbon. Select "Remove Extra Space" to remove the extra space.

Which is used to adjusting the space between letters and words?

Kerning is the process of adjusting the space between letter pairs. Certain letter pairs appear to have more space between them than others because of the shape and slant of each letter. Automatic kerning adjusts the distance between all occurrences of certain letter pairs depending on the font used.


1 Answers

To resolve this problem in Persian we have a character calls Zero-width_non-joiner (or نیم‌فاصله in Persian or half space or semi space) which has two symbol codes. One is standard and the other is not standard but widely used :

  1. \u200C : http://en.wikipedia.org/wiki/Zero-width_non-joiner
  2. \u200F : Right-to-left mark (http://unicode-table.com/en/#200F)

As I know Dari is very similar to Persian. So first of all you should correct all the words like زنده گی to زنده‌گی and convert all wrong spaces to half spaces then you can simply use this regex to match all words of a sentence:

[\u0600-\u06FF\uFB8A\u067E\u0686\u06AF\u200C\u200F]+

Online demo (the black bullet in test string is half space which is not recognizable for regex101 but if you check the match information part and see Match 5 you will see that is correct)

For converting wrong spaces of a huge text to half spaces there is an add on for Microsoft word calls virastyar which is free and open source. You can install it and refine your whole text. But consider this add on is created for Persian and not Dari. For example In Persian we write زنده‌گی as زندگی and it can not correct this word for you. But the other words like می شود would easily corrects and converts to می‌شود. Also you can add custom words to the database.

like image 189
Mohsen Avatar answered Oct 15 '22 14:10

Mohsen