I'm using NLTK <code>word_tokenizer</code> to split a sentence into words. I want to tokenize this sentence: <pre class="prettyprint"><code>في_بيتنا كل شي لما تحتاجه يضيع ...ادور على شاحن فجأة يختفي ..لدرجة اني اسوي نفسي ادور شيء </code></pre> The code I'm writing is: <pre class="prettyprint"><code>import re import nltk lex = u" في_بيتنا كل شي لما تحتاجه يضيع ...ادور على شاحن فجأة يختفي ..لدرجة اني اسوي نفسي ادور شيء" wordsArray = nltk.word_tokenize(lex) print " ".join(wordsArray) </code></pre> The problem is that the <code>word_tokenize</code> function doesn't split by words. Instead, it splits by letters so that the output is: <pre class="prettyprint"><code>"ف ي _ ب ي ت ن ا ك ل ش ي ل م ا ت ح ت ا ج ه ي ض ي ع ... ا د و ر ع ل ى ش ا ح ن ف ج أ ة ي خ ت ف ي .. ل د ر ج ة ا ن ي ا س و ي ن ف س ي ا د و ر ش ي ء" </code></pre> Any ideas ? What I've reached so far: By trying the text in here, it appeared to be tokenized by letters. Also, however, other tokenizers tokenized it correctly. Does that mean that <code>word_tokenize</code> is for English only? Does that go for most of NLTK functions?

I always recommend using <code>nltk.tokenize.wordpunct_tokenize</code>. You can try out many of the NLTK tokenizers at http://text-processing.com/demo/tokenize/ and see for yourself.

Tokenization of Arabic words using NLTK

Tags:

I'm using NLTK word_tokenizer to split a sentence into words.

I want to tokenize this sentence:

في_بيتنا كل شي لما تحتاجه يضيع ...ادور على شاحن فجأة يختفي ..لدرجة اني اسوي نفسي ادور شيء

The code I'm writing is:

import re import nltk  lex = u" في_بيتنا كل شي لما تحتاجه يضيع ...ادور على شاحن فجأة يختفي ..لدرجة اني اسوي نفسي ادور شيء"  wordsArray = nltk.word_tokenize(lex) print " ".join(wordsArray)

The problem is that the word_tokenize function doesn't split by words. Instead, it splits by letters so that the output is:

"ف ي _ ب ي ت ن ا ك ل ش ي ل م ا ت ح ت ا ج ه ي ض ي ع ... ا د و ر ع ل ى ش ا ح ن ف ج أ ة ي خ ت ف ي .. ل د ر ج ة ا ن ي ا س و ي ن ف س ي ا د و ر ش ي ء"

Any ideas ?

What I've reached so far:

By trying the text in here, it appeared to be tokenized by letters. Also, however, other tokenizers tokenized it correctly. Does that mean that word_tokenize is for English only? Does that go for most of NLTK functions?

690

asked Oct 23 '12 16:10

Hady Elsahar

2 Answers

I always recommend using nltk.tokenize.wordpunct_tokenize. You can try out many of the NLTK tokenizers at http://text-processing.com/demo/tokenize/ and see for yourself.

198

answered Sep 20 '22 19:09

Jacob

this is the output i get with my code, but i recall unicode doesn't go well in python 2 and I used 3.5

nltk.word_tokenize('في_بيتنا كل شي لما تحتاجه يضيع ...ادور على شاحن فجأة يختفي ..لدرجة اني اسوي نفسي ادور شيء ')

['في_بيتنا', 'كل', 'شي', 'لما', 'تحتاجه', 'يضيع', '...', 'ادور', 'على', 'شاحن', 'فجأة', 'يختفي', '..لدرجة', 'اني', 'اسوي', 'نفسي', 'ادور', 'شيء']

answered Sep 20 '22 19:09

Pradi KL

Related questions
                            
                                How to set mongodb charset to utf8?
                            
                                TimeGrouper, pandas
                            
                                Nodejs Cannot find module '../build/Release/canvas'
                            
                                Python-like pickling of full Javascript objects
                            
                                Why has JSON.NET default DateTime serialization changed?
                            
                                CSS Table-Like alignment using no tables? (CSS RelativeLayout)
                            
                                Android - How to retrieve list of registered geofences
                            
                                PDFsharp edit a pdf file
                            
                                Make fragments full screen programmatically
                            
                                Heatmap.js not rendering
                            
                                Correct way to call async methods from within a data-bound property setter?
                            
                                Effective pagination with Active Directory searches

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With