Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spacy Japanese Tokenizer

I am trying to use Spacy's Japanese tokenizer.

import spacy
Question= 'すぺいんへ いきました。'
nlp(Question.decode('utf8'))

I am getting the below error,

TypeError: Expected unicode, got spacy.tokens.token.Token

Any ideas on how to fix this?

Thanks!

like image 297
AKSHAYAA VAIDYANATHAN Avatar asked Nov 01 '17 11:11

AKSHAYAA VAIDYANATHAN


People also ask

How do you Tokenize in Japanese?

However, in Japanese, words are normally written without any space between. Japanese tokenization requires reading/analyzing the whole sentence, recognizing words, and determining word boundaries without any explicit delimiters. Most Japanese tokenizers use Lattice-based tokenization.

How do you use a spaCy Tokenizer?

In Spacy, the process of tokenizing a text into segments of words and punctuation is done in various steps. It processes the text from left to right. First, the tokenizer split the text on whitespace similar to the split() function. Then the tokenizer checks whether the substring matches the tokenizer exception rules.

Which is better NLTK or spaCy?

While NLTK provides access to many algorithms to get something done, spaCy provides the best way to do it. It provides the fastest and most accurate syntactic analysis of any NLP library released to date. It also offers access to larger word vectors that are easier to customize.


2 Answers

According to Spacy, tokenization for Japanese language using spacy is still in alpha phase. The ideal way for tokenization is to provide tokenized word list with information pertaining to language structure also. For example, for a english language sentence, you can try this

import spacy
nlp = spacy.load("en") # execute "python -m spacy download en" before this on standard console
sentence = "Writing some answer on stackoverflow, as an example for spacy language model"
print(["::".join((word.orth_, word.pos_)) for word in nlp(sentence)])
## <OUTPUT>
## ['Writing::VERB', 'some::DET', 'answer::NOUN', 'on::ADP', 'stackoverflow::NOUN', ',::PUNCT', 'as::ADP', 'an::DET', 'example::NOUN', 'for::ADP', 'spacy::ADJ', 'language::NOUN', 'model::NOUN']

Such results are currently not available for Japanese Language. If you use python -m spacy download xx and use nlp = spacy.load("xx"), it tries best to understand named entities

Also if you look at the source for spacy at here, you will see that tokenization is available, but it brings forth only a make_doc function, which is quite naive. Note: The pip version of spacy is still old code, the above link for github still has a bit to latest code.

So for building a tokenization, it is highly suggested as of now to use janome An example for this is given below

from janome.tokenizer import Tokenizer as janome_tokenizer
sentence = "日本人のものと見られる、延べ2億件のメールアドレスとパスワードが闇サイトで販売されていたことがわかりました。過去に漏えいしたデータを集めたものと見られ、調査に当たったセキュリティー企業は、日本を狙ったサイバー攻撃のきっかけになるおそれがあるとして注意を呼びかけています。"
token_object = janome_tokenizer()
[x.surface for x in token_object.tokenize(sentence)]
## <OUTPUT> ##
## ['日本人', 'の', 'もの', 'と', '見', 'られる', '、', '延べ', '2', '億', '件', 'の', 'メールアドレス', 'と', 'パスワード', 'が', '闇', 'サイト', 'で', '販売', 'さ', 'れ', 'て', 'い', 'た', 'こと', 'が', 'わかり', 'まし', 'た', '。', '過去', 'に', '漏えい', 'し', 'た', 'データ', 'を', '集め', 'た', 'もの', 'と', '見', 'られ', '、', '調査', 'に', '当たっ', 'た', 'セキュリティー', '企業', 'は', '、', '日本', 'を', '狙っ', 'た', 'サイバー', '攻撃', 'の', 'きっかけ', 'に', 'なる', 'お', 'それ', 'が', 'ある', 'として', '注意', 'を', '呼びかけ', 'て', 'い', 'ます', '。']
## you can look at
## for x in token_object.tokenize(sentence):
##     print(x)
## <OUTPUT LIKE>:
## 日本人    名詞,一般,*,*,*,*,日本人,ニッポンジン,ニッポンジン
## の        助詞,連体化,*,*,*,*,の,ノ,ノ
## もの      名詞,非自立,一般,*,*,*,もの,モノ,モノ
## と        助詞,格助詞,引用,*,*,*,と,ト,ト
## ....
## <OUTPUT Truncated>

I think spacy team is working on similar output to build models for Japanese language so that "language specific" constructs could be made for Japanese also, similar to the ones for other languages.

Update

After writing, out of curiosity, I started to search around. Please check udpipe here, here & here It seems udpipe supports more than 50 languages, and it provides solution to problem we see in Spacy as far as language support is concerned.

like image 180
Anugraha Sinha Avatar answered Oct 19 '22 06:10

Anugraha Sinha


Try using this:

import spacy

question = u'すぺいんへ いきました。'
nlp(question)
like image 2
Bhushan Pant Avatar answered Oct 19 '22 06:10

Bhushan Pant