I'm using fugashi to extract words from sentences. How do I add new terms that are not in the fugacy dictionary to the dictionary?
For example, YouTube is divided into "You" and "Tube."
import fugashi
tagger = fugashi.Tagger()
nodes = tagger.parseToNodeList("ユーチューブ")
goodpos = ['名詞']
nodes = [nn.surface for nn in nodes if nn.feature.pos1 in goodpos]
=> ['ユー', 'チューブ']
I haven't gotten around to making a proper guide for this yet, but basically you should follow the MeCab docs, but you can use fugashi-build-dict instead of mecab-dict-index.
To give brief instructions, first you need to make a CSV file that uses the same format as your system dictionary. This is based on unidic-lite.
令和,4786,4786,8205,名詞,固有名詞,一般,*,*,*,レイワ,令和,令和,レーワ,令和,レーワ,固,*,*,*,*,*,*,*,レイワ,レイワ,レイワ,レイワ,"1,0",*,*,*,*
㋿,5969,5969,2588,補助記号,一般,*,*,*,*,,㋿,㋿,,㋿,,記号,*,*,*,*,*,*,*,,,,,*,*,*,*,999999
㋿,4786,4786,3992,名詞,固有名詞,一般,*,*,*,レイワ,令和,㋿,レーワ,㋿,レーワ,固,*,*,*,*,*,*,*,レイワ,レイワ,レイワ,レイワ,"1,0",*,*,*,*
夢夢,4786,4786,8205,名詞,固有名詞,一般,*,*,*,レイワ,令和,令和,レーワ,令和,レーワ,固,*,*,*,*,*,*,*,レイワ,レイワ,レイワ,レイワ,"1,0",*,*,*,*
You can make this by copying entries from the UniDic source and editing fields. Then you run this command:
fugashi-build-dict -d dicdir/ -u mydic.dic mydic.csv
dicdir is the location of your system dictionary, mydic.csv is the csv file you made. This will create the mydic.dic file, which you can then use with fugashi by specifying -u mydic.dic.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With