I am trying to use Spacy's Japanese tokenizer. <pre class="prettyprint"><code>import spacy Question= 'すぺいんへいきました。' nlp(Question.decode('utf8')) </code></pre> I am getting the below error, <pre class="prettyprint"><code>TypeError: Expected unicode, got spacy.tokens.token.Token </code></pre> Any ideas on how to fix this? Thanks!

Try using this: <pre class="prettyprint lang-py prettyprint-override"><code>import spacy question = u'すぺいんへいきました。' nlp(question) </code></pre>

Spacy Japanese Tokenizer

Tags:

spacy

I am trying to use Spacy's Japanese tokenizer.

import spacy
Question= 'すぺいんへ いきました。'
nlp(Question.decode('utf8'))

I am getting the below error,

TypeError: Expected unicode, got spacy.tokens.token.Token

Any ideas on how to fix this?

Thanks!

297

asked Nov 01 '17 11:11

2 Answers

According to Spacy, tokenization for Japanese language using spacy is still in alpha phase. The ideal way for tokenization is to provide tokenized word list with information pertaining to language structure also. For example, for a english language sentence, you can try this

import spacy
nlp = spacy.load("en") # execute "python -m spacy download en" before this on standard console
sentence = "Writing some answer on stackoverflow, as an example for spacy language model"
print(["::".join((word.orth_, word.pos_)) for word in nlp(sentence)])
## <OUTPUT>
## ['Writing::VERB', 'some::DET', 'answer::NOUN', 'on::ADP', 'stackoverflow::NOUN', ',::PUNCT', 'as::ADP', 'an::DET', 'example::NOUN', 'for::ADP', 'spacy::ADJ', 'language::NOUN', 'model::NOUN']

Such results are currently not available for Japanese Language. If you use python -m spacy download xx and use nlp = spacy.load("xx"), it tries best to understand named entities

Also if you look at the source for spacy at here, you will see that tokenization is available, but it brings forth only a make_doc function, which is quite naive. Note: The pip version of spacy is still old code, the above link for github still has a bit to latest code.

So for building a tokenization, it is highly suggested as of now to use janome An example for this is given below

from janome.tokenizer import Tokenizer as janome_tokenizer
sentence = "日本人のものと見られる、延べ２億件のメールアドレスとパスワードが闇サイトで販売されていたことがわかりました。過去に漏えいしたデータを集めたものと見られ、調査に当たったセキュリティー企業は、日本を狙ったサイバー攻撃のきっかけになるおそれがあるとして注意を呼びかけています。"
token_object = janome_tokenizer()
[x.surface for x in token_object.tokenize(sentence)]
## <OUTPUT> ##
## ['日本人', 'の', 'もの', 'と', '見', 'られる', '、', '延べ', '２', '億', '件', 'の', 'メールアドレス', 'と', 'パスワード', 'が', '闇', 'サイト', 'で', '販売', 'さ', 'れ', 'て', 'い', 'た', 'こと', 'が', 'わかり', 'まし', 'た', '。', '過去', 'に', '漏えい', 'し', 'た', 'データ', 'を', '集め', 'た', 'もの', 'と', '見', 'られ', '、', '調査', 'に', '当たっ', 'た', 'セキュリティー', '企業', 'は', '、', '日本', 'を', '狙っ', 'た', 'サイバー', '攻撃', 'の', 'きっかけ', 'に', 'なる', 'お', 'それ', 'が', 'ある', 'として', '注意', 'を', '呼びかけ', 'て', 'い', 'ます', '。']
## you can look at
## for x in token_object.tokenize(sentence):
##     print(x)
## <OUTPUT LIKE>:
## 日本人    名詞,一般,*,*,*,*,日本人,ニッポンジン,ニッポンジン
## の        助詞,連体化,*,*,*,*,の,ノ,ノ
## もの      名詞,非自立,一般,*,*,*,もの,モノ,モノ
## と        助詞,格助詞,引用,*,*,*,と,ト,ト
## ....
## <OUTPUT Truncated>

I think spacy team is working on similar output to build models for Japanese language so that "language specific" constructs could be made for Japanese also, similar to the ones for other languages.

Update

After writing, out of curiosity, I started to search around. Please check udpipe here, here & here It seems udpipe supports more than 50 languages, and it provides solution to problem we see in Spacy as far as language support is concerned.

180

answered Oct 19 '22 06:10

Anugraha Sinha

Try using this:

import spacy

question = u'すぺいんへ いきました。'
nlp(question)

answered Oct 19 '22 06:10

Bhushan Pant

Related questions
                            
                                Python subprocess check_output decoding specials characters
                            
                                Seaborn: Remove fit from distplot
                            
                                formatting timedelta64 when using pandas.to_excel
                            
                                Elastic net regression or lasso regression with weighted samples (sklearn)
                            
                                How to convert a numpy array (which is actually a BGR image) to Base64 string?
                            
                                Returning cleaned_data when overwriting clean() method in Django Model forms
                            
                                Python SyntaxError: invalid syntax, are brackets allowed in function parameters in python3?
                            
                                Override dict() on class
                            
                                StringIO generated csv file that includes BOM
                            
                                How to get egg or wheel file of pip-installed python package?
                            
                                Vectorized NumPy linspace across multi-dimensional arrays
                            
                                Python 3: Best practice way to validate/parse **kwargs?
                            
                                How does the Python compiler preprocess the source file with the declared encoding?
                            
                                Why xrange is not defined when I'm not using xrange in the first place?
                            
                                boto3 S3: get_object error handling
                            
                                Calling an Ansible Module from another Ansible Module?
                            
                                Convert milliseconds column to HH:mm:ss:ms datetime in Pandas
                            
                                Get name of folders in zip files - Python
                            
                                OAuth2 Client (Python/Django)
                            
                                Dissolve Overlapping Polygons (with GDAL/OGR) while keeping non-connected results distinct

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spacy Japanese Tokenizer

Tags:

python

nlp

cjk

spacy

AKSHAYAA VAIDYANATHAN

People also ask

2 Answers

Anugraha Sinha

Bhushan Pant

Recent Activity

Donate For Us