I am trying to use Spacy's Japanese tokenizer.
import spacy
Question= 'すぺいんへ いきました。'
nlp(Question.decode('utf8'))
I am getting the below error,
TypeError: Expected unicode, got spacy.tokens.token.Token
Any ideas on how to fix this?
Thanks!
However, in Japanese, words are normally written without any space between. Japanese tokenization requires reading/analyzing the whole sentence, recognizing words, and determining word boundaries without any explicit delimiters. Most Japanese tokenizers use Lattice-based tokenization.
In Spacy, the process of tokenizing a text into segments of words and punctuation is done in various steps. It processes the text from left to right. First, the tokenizer split the text on whitespace similar to the split() function. Then the tokenizer checks whether the substring matches the tokenizer exception rules.
While NLTK provides access to many algorithms to get something done, spaCy provides the best way to do it. It provides the fastest and most accurate syntactic analysis of any NLP library released to date. It also offers access to larger word vectors that are easier to customize.
According to Spacy, tokenization for Japanese language using spacy is still in alpha phase. The ideal way for tokenization is to provide tokenized word list with information pertaining to language structure also. For example, for a english language sentence, you can try this
import spacy
nlp = spacy.load("en") # execute "python -m spacy download en" before this on standard console
sentence = "Writing some answer on stackoverflow, as an example for spacy language model"
print(["::".join((word.orth_, word.pos_)) for word in nlp(sentence)])
## <OUTPUT>
## ['Writing::VERB', 'some::DET', 'answer::NOUN', 'on::ADP', 'stackoverflow::NOUN', ',::PUNCT', 'as::ADP', 'an::DET', 'example::NOUN', 'for::ADP', 'spacy::ADJ', 'language::NOUN', 'model::NOUN']
Such results are currently not available for Japanese Language.
If you use python -m spacy download xx
and use nlp = spacy.load("xx")
, it tries best to understand named entities
Also if you look at the source for spacy at here, you will see that tokenization is available, but it brings forth only a make_doc
function, which is quite naive.
Note: The pip version of spacy is still old code, the above link for github still has a bit to latest code.
So for building a tokenization, it is highly suggested as of now to use janome An example for this is given below
from janome.tokenizer import Tokenizer as janome_tokenizer
sentence = "日本人のものと見られる、延べ2億件のメールアドレスとパスワードが闇サイトで販売されていたことがわかりました。過去に漏えいしたデータを集めたものと見られ、調査に当たったセキュリティー企業は、日本を狙ったサイバー攻撃のきっかけになるおそれがあるとして注意を呼びかけています。"
token_object = janome_tokenizer()
[x.surface for x in token_object.tokenize(sentence)]
## <OUTPUT> ##
## ['日本人', 'の', 'もの', 'と', '見', 'られる', '、', '延べ', '2', '億', '件', 'の', 'メールアドレス', 'と', 'パスワード', 'が', '闇', 'サイト', 'で', '販売', 'さ', 'れ', 'て', 'い', 'た', 'こと', 'が', 'わかり', 'まし', 'た', '。', '過去', 'に', '漏えい', 'し', 'た', 'データ', 'を', '集め', 'た', 'もの', 'と', '見', 'られ', '、', '調査', 'に', '当たっ', 'た', 'セキュリティー', '企業', 'は', '、', '日本', 'を', '狙っ', 'た', 'サイバー', '攻撃', 'の', 'きっかけ', 'に', 'なる', 'お', 'それ', 'が', 'ある', 'として', '注意', 'を', '呼びかけ', 'て', 'い', 'ます', '。']
## you can look at
## for x in token_object.tokenize(sentence):
## print(x)
## <OUTPUT LIKE>:
## 日本人 名詞,一般,*,*,*,*,日本人,ニッポンジン,ニッポンジン
## の 助詞,連体化,*,*,*,*,の,ノ,ノ
## もの 名詞,非自立,一般,*,*,*,もの,モノ,モノ
## と 助詞,格助詞,引用,*,*,*,と,ト,ト
## ....
## <OUTPUT Truncated>
I think spacy team is working on similar output to build models for Japanese language so that "language specific" constructs could be made for Japanese also, similar to the ones for other languages.
Update
After writing, out of curiosity, I started to search around. Please check udpipe here, here & here It seems udpipe supports more than 50 languages, and it provides solution to problem we see in Spacy as far as language support is concerned.
Try using this:
import spacy
question = u'すぺいんへ いきました。'
nlp(question)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With