I am using nltk.word_tokenize
in Dari language. The problem is that we have space between one word.
For example the word "زنده گی"
which means life. And the same; we have many other words. All words which end with the character "ه"
we have to give a space for it, otherwise, it can be combined such as "زندهگی"
.
Can anyone help me using [tag:regex]
or any other way that should not tokenize the words that a part of one word ends with "ه"
and after that, there will be the "گ "
character.
Set your cursor to the location of the paragraph spacing. Click on the Line and Paragraph Spacing icon in the Home Ribbon. Select "Remove Extra Space" to remove the extra space.
Kerning is the process of adjusting the space between letter pairs. Certain letter pairs appear to have more space between them than others because of the shape and slant of each letter. Automatic kerning adjusts the distance between all occurrences of certain letter pairs depending on the font used.
To resolve this problem in Persian we have a character calls Zero-width_non-joiner (or نیمفاصله in Persian or half space or semi space) which has two symbol codes. One is standard and the other is not standard but widely used :
As I know Dari is very similar to Persian. So first of all you should correct all the words like زنده گی
to زندهگی
and convert all wrong spaces to half spaces then you can simply use this regex to match all words of a sentence:
[\u0600-\u06FF\uFB8A\u067E\u0686\u06AF\u200C\u200F]+
Online demo (the black bullet in test string is half space which is not recognizable for regex101 but if you check the match information part and see Match 5
you will see that is correct)
For converting wrong spaces of a huge text to half spaces there is an add on for Microsoft word calls virastyar which is free and open source. You can install it and refine your whole text. But consider this add on is created for Persian and not Dari. For example In Persian we write زندهگی
as زندگی
and it can not correct this word for you. But the other words like می شود
would easily corrects and converts to میشود
. Also you can add custom words to the database.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With