Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I use Python NLTK to identify collocations among single characters?

I want to use NLTK to identify collocations among particular kanji characters in Japanese and hanzi characters in Chinese. As with word collocations, some sequences of Chinese characters are far more likely than others. Example: Many words in Chinese and Japanese are two-character bigrams — character A and character B (e.g. 日本 = Japan, ni-hon in Japanese and ri-ben in Chinese). Given character A (日), it is much more likely that 本 will appear as character B. So the characters 日 and 本 are collocates.

I want to use NLTK to find out answers to these questions:

(1) Given character A, what characters are most likely to be character B?

(2) Given character B, what characters are most likely to be character A?

(3) How likely are character A and character B to appear together in a sentence, even if they do not appear side-by-side?

Relatedly: if I a have a frequency list of kanji/hanzi, can I force the NLTK collocations module to examine only relations among the kanji/hanzi in my list, ignoring all other characters? This would filter out results in which single Roman letters (a, b, c, etc.) or punctuation marks are considered in the set of possible collocates.

Unfortunately, the documentation, how-to, and source code for nltk.collocations and the NLTK Book only discuss English NLP, and understandably do not address the question of single-character collocations. Functions in the nltk.collocations module seem to have a word tokenizer built in, so I think they ignore single characters by default.

UPDATE: The following code seems to be on the right track:

def main():
    scorer = nltk.collocations.BigramAssocMeasures.likelihood_ratio
    with open('sample_jp_text.txt', mode='r') as infile:
        sample_text = infile.read()
    bigram_measures = nltk.collocations.BigramAssocMeasures()
    finder = BigramCollocationFinder.from_words(sample_text,window_size = 13)
    #corpus = make_corpus()
    print('\t', [' '.join(tup) for tup in finder.nbest(scorer, 15)])

Results:

 ['リ ザ', 'フ ザ', 'フ リ', '0 0', '悟 空', 'リ ー', 'ー ザ', '億 0', '2 0', 'サ ヤ', '0 万', 'サ 人', '0 円', '復 活', '。 \n']

So, for whatever reason, the BigramCollocationFinder seems to be treating the individual characters in my Japanese sample text as candidates for bigram collocations. I'm still not sure how to take the next step from this result to answering the questions posed above.

like image 843
WordBrewery Avatar asked Apr 23 '17 20:04

WordBrewery


1 Answers

Most probably you're not stuck with the ngram part of the task but how to clean your data so that you get Kanji words from the mess of other characters.

Here's a hack but it'll require the charguana library:

# -*- coding: utf-8 -*-

from string import punctuation
# Older version:
#from charguana.cjk import get_charset
from charguana import get_charset


hiragana = list(get_charset('hiragana'))
katakana = list(get_charset('katakana'))
cjk_punctuations = list(get_charset('punctuation'))
romanji = list(get_charset('romanji'))

mccarl_stoplist = ['"', '#', '$', '%', '&', "'", '(', ')', '*', '+', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '=', '>', '?', '@', '[', ']', '^', '_', '`', 'a', 'A', 'b', 'B', 'c', 'C', 'd', 'D', 'e', 'E', 'F', 'f', 'g', 'G', 'h', 'H', 'i', 'I', 'j', 'J', 'k', 'K', 'l', 'L', 'm', 'M', 'n', 'N', 'o', 'O', 'p', 'P', 'q', 'Q', 'r', 'R', 's', 'S', 't', 'T', 'u', 'U', 'v', 'V', 'w', 'W', 'x', 'X', 'y', 'Y', 'z', 'Z', '{', '|', '}', '~', ' ', '£', '§', '®', '°', '±', '²', '´', '¿', '×', 'ß', 'ẞ', 'Á', 'á', 'â', 'ã', 'ä', 'ç', 'è', 'é', 'É', 'ê', 'í', 'î', 'ï', 'ñ', 'Ñ', 'ó', 'Ó', 'ô', 'Ö', 'ö', '÷', 'ú', 'ü', 'Ü', 'ý', 'þ', 'ā', 'Ā', 'ć', 'č', 'Č', 'Ď', 'ě', 'ī', 'ı', 'ł', 'ń', 'ś', 'ť', 'Ż', 'Ž', 'ƛ', 'ɱ', 'ʏ', 'ʒ', 'ʻ', 'ʿ', '˚', '́', '̃', 'ί', 'α', 'β', 'Δ', 'ε', 'ζ', 'θ', 'λ', 'μ', 'ν', 'ξ', 'ο', 'π', 'ς', 'Σ', 'σ', 'ω', 'Ω', 'а', 'А', 'б', 'Б', 'В', 'в', 'Г', 'г', 'д', 'е', 'Е', 'ж', 'з', 'З', 'и', 'И', 'й', 'К', 'к', 'л', 'Л', 'М', 'м', 'н', 'О', 'о', 'П', 'п', 'Р', 'р', 'с', 'С', 'т', 'у', 'Ф', 'ф', 'х', 'ц', 'Ч', 'ч', 'ъ', 'ы', 'ь', 'Э', 'я', 'Я', 'ђ', 'Ә', 'Ի', 'ո', 'ا', 'ر', 'ع', 'ك', 'م', 'و', 'ُ', 'ٹ', 'ٽ', 'ڪ', 'ܕ', 'अ', 'ट', 'ड', 'त', 'थ', 'न', 'म', 'ल', 'व', 'श', 'ा', 'ी', 'े', 'ो', '्', 'ই', 'গ', 'ধ', 'ব', 'ল', 'শ', 'ে', 'ਬ', 'ભ', 'ી', 'ଭ', 'த', 'ట', 'ర', 'స', 'ಟ', 'ോ', 'හ', 'ง', 'ย', 'ห', 'ิ', 'ู', 'ເ', 'ແ', 'ང', 'ཆ', 'ོ', 'ྩ', 'န', 'း', 'პ', 'წ', 'ለ', 'ማ', 'ቱ', 'ክ', 'ደ', 'ខ', 'ឹ', 'ḡ', 'Ḫ', 'ḻ', 'ṁ', 'ṃ', 'Ẑ', 'ễ', 'ỉ', 'ự', 'Ὡ', 'ῶ', '‐', '–', '—', '―', '‘', '’', '“', '”', '†', '‥', '…', '′', '※', '₣', '℃', 'ℓ', '←', '↑', '→', '↓', '⇒', '⇔', '∃', '∈', '−', '∗', '∞', '∴', '≈', '≒', '≠', '≡', '≥', '⎱', '␏', '␡', '①', '②', '③', '④', '⑤', '⑰', '─', '━', '┃', '┛', '┫', '╳', '■', '□', '▪', '▲', '△', '▼', '▽', '○', '◎', '★', '☆', '☓', '♂', '♡', '♢', '♣', '♥', '♪', '♭', '✕', '✖', '❝', 'ⵃ', '⺌', '⺕', '⺮', '⺼', '⻌', '⻎', '\u3000', '、', '。', '〃', '〆', '〇', '〈', '〉', '《', '》', '「', '」', '『', '』', '【', '】', '〒', '〓', '〔', '〕', '〜', '〡', '〳', '〴', '〵', '〻', 'ゎ', 'ゐ', 'ゑ', 'ゔ', 'ゕ', 'ゖ', '゙', '゛', '゜', 'ゝ', 'ゞ', 'ゟ', 'ヮ', 'ヷ', 'ヸ', 'ヹ', 'ヺ', '・', 'ー', 'ヽ', 'ヾ', 'ヿ', 'ㇰ', 'ㇱ', 'ㇲ', 'ㇳ', 'ㇴ', 'ㇵ', 'ㇶ', 'ㇷ', 'ㇸ', 'ㇹ', 'ㇺ', 'ㇻ', 'ㇼ', 'ㇽ', 'ㇾ', 'ㇿ', '㋖', '㋚', '㋡', '㋣', '㋨', '㋪', '㋮', '㋲', '㋹', '㌔', '㌘', '㌢', '㌣', '㌦', '㌧', '㌫', '㌻', '㍉', '㍍', '㍑', '㎞', '㎡', '㎥', '㐅', '나', '딜', '르', '림', '만', '메', '문', '뮤', '약', '오', '왕', '인', '입', '쟁', '정', '펜', '항', '했', '형', '화', '훈', '艹', '辶', '!', '$', '%', '&', '(', ')', '*', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '=', '>', '?', '@', '[', ']', '^', '_', '`', 'A', 'a', 'b', 'B', 'c', 'C', 'd', 'D', 'e', 'E', 'F', 'g', 'G', 'h', 'H', 'I', 'i', 'j', 'J', 'k', 'K', 'L', 'l', 'm', 'M', 'n', 'N', 'O', 'o', 'p', 'P', 'Q', 'r', 'R', 'S', 's', 'T', 't', 'U', 'u', 'v', 'V', 'w', 'W', 'X', 'x', 'Y', 'y', 'Z', 'z', '{', '|', '}', '~', '。', '「', '」', '、', '・', 'ヲ', 'ァ', 'ィ', 'ゥ', 'ェ', 'ォ', 'ャ', 'ュ', 'ョ', 'ッ', 'ー', 'ア', 'イ', 'ウ', 'エ', 'オ', 'カ', 'キ', 'ク', 'ケ', 'コ', 'サ', 'シ', 'ス', 'セ', 'ソ', 'タ', 'チ', 'ツ', 'テ', 'ト', 'ナ', 'ニ', 'ヌ', 'ネ', 'ノ', 'ハ', 'ヒ', 'フ', 'ヘ', 'ホ', 'マ', 'ミ', 'ム', 'メ', 'モ', 'ヤ', 'ユ', 'ヨ', 'ラ', 'リ', 'ル', 'レ', 'ロ', 'ワ', 'ン', '゙', '゚', '£', ' ̄', '¥', '', '𛀀', '𛀁', '👉', '𠀋', '𠀧', '𠁠', '𠃉', '𠃊', '𠃋', '𠃌', '𠃍', '𠃎', '𠃏', '𠃐', '𠃑', '𠃒', '𠃓', '𠃔', '𠃕', '𠃖', '𠃗', '𠃘', '𠃙', '𠃚', '𠃛', '𠃜', '𠃝', '𠃞', '𠃟', '𠃠', '𠃡', '𠃢', '𠃣', '𠃤', '𠃥', '𠃦', '𠃧', '𠃨', '𠃩', '𠃪', '𠃫', '𠃬', '𠃭', '𠃯', '𠃰', '𠃱', '𠃲', '𠃳', '𠃴', '𠄇', '𠄈', '𠄉', '𠄊', '𠄎', '𠄔', '𠄕', '𠄙', '𠄧', '𠄩', '𠄶', '𠄻', '𠄼', '𠅨', '𠆢', '𠆭', '𠇇', '𠇗', '𠊛', '𠍋', '𠓿', '𠔻', '𠘗', '𠘡', '𠘤', '𠘥', '𠘦', '𠦤', '𠦴', '𠧈', '𠧉', '𠧟', '𠨡', '𠪳', '𠫑', '𠫓', '𠬠', '𡅦', '𡆓', '𡆟', '𡆢', '𡍢', '𡎴', '𡒉', '𡔖', '𡗅', '𡗐', '𡘯', '𡙧', '𡚥', '𡤓', '𡦹', '𡧑', '𡨧', '𡨸', '𡮲', '𡳾', '𡴫', '𡶛', '𡸇', '𡸕', '𢀓', '𢀳', '𢁋', '𢁐', '𢁑', '𢁒', '𢁓', '𢄂', '𢆅', '𢆇', '𢆥', '𢆯', '𢆲', '𢆴', '𢇓', '𢇖', '𢈴', '𢍰', '𢍺', '𢎒', '𢎕', '𢎖', '𢎗', '𢎘', '𢎙', '𢎜', '𢎞', '𢎥', '𢎧', '𢎵', '𢑎', '𢑽', '𢖟', '𢘑', '𢘥', '𢙯', '𢦒', '𢨋', '𢨣', '𢩢', '𢶏', '𢷋', '𢾺', '𣁔', '𣁕', '𣁖', '𣁫', '𣂑', '𣃥', '𣌒', '𣎳', '𣒱', '𣓏', '𣖾', '𣛦', '𣛧', '𣛭', '𣜬', '𣠰', '𣠶', '𣡕', '𣡽', '𣰰', '𣱹', '𣲧', '𣴓', '𣶒', '𣷚', '𣷭', '𣹹', '𤅳', '𤅴', '𤆀', '𤆁', '𤆼', '𤇜', '𤊞', '𤊟', '𤌇', '𤌤', '𤐤', '𤓬', '𤕌', '𤕓', '𤘓', '𤘽', '𤙭', '𤟹', '𤪠', '𤰃', '𤰌', '𤰑', '𤰓', '𤰞', '𤳆', '𤴐', '𤴑', '𤴔', '𤼲', '𤼵', '𥂕', '𥐓', '𥒐', '𥜸', '𥜹', '𥜺', '𥝌', '𥤟', '𥤠', '𥤡', '𥥛', '𥯌', '𥲅', '𥶄', '𥸯', '𥻸', '𦇒', '𦉩', '𦓐', '𦓔', '𦓙', '𦓝', '𦓡', '𦓢', '𦘐', '𦚔', '𦝄', '𦠆', '𦨂', '𦨅', '𦮙', '𦹗', '𦻙', '𧈢', '𧊒', '𧏡', '𧒽', '𧘇', '𧢨', '𧢬', '𧢰', '𧢱', '𧥱', '𧫷', '𧰡', '𧲜', '𧲝', '𧲞', '𧲟', '𧴫', '𧶛', '𧾷', '𨁂', '𨈺', '𨋓', '𨐄', '𨐋', '𨐤', '𨐼', '𨑀', '𨑁', '𨒒', '𨜒', '𨡕', '𨣎', '𨤽', '𨮁', '𨰵', '𨰸', '𨰺', '𨰼', '𨱖', '𨱽', '𨷈', '𨽶', '𩁧', '𩃙', '𩅦', '𩇔', '𩎓', '𩏊', '𩏶', '𩏷', '𩏸', '𩏹', '𩑛', '𩖃', '𩙿', '𩠑', '𩠒', '𩠓', '𩠔', '𩠕', '𩠖', '𩠗', '𩠘', '𩠙', '𩠚', '𩠛', '𩠜', '𩠝', '𩠞', '𩠟', '𩠠', '𩠡', '𩠢', '𩠣', '𩠤', '𩠥', '𩠦', '𩠧', '𩠨', '𩠩', '𩠪', '𩠫', '𩠬', '𩠭', '𩠮', '𩠯', '𩠰', '𩠱', '𩠲', '𩠳', '𩠴', '𩠵', '𩠶', '𩠷', '𩠸', '𩠺', '𩠻', '𩠼', '𩠽', '𩠾', '𩠿', '𩡀', '𩡁', '𩡂', '𩡃', '𩡄', '𩡅', '𩡆', '𩡇', '𩡈', '𩡉', '𩡊', '𩡋', '𩡌', '𩡍', '𩡎', '𩡏', '𩡐', '𩡑', '𩡒', '𩡓', '𩡔', '𩡕', '𩡖', '𩡗', '𩡘', '𩡙', '𩡚', '𩡛', '𩡜', '𩡝', '𩡞', '𩡟', '𩡠', '𩡡', '𩡢', '𩡣', '𩡤', '𩡥', '𩡦', '𩥭', '𩰠', '𩰪', '𩲃', '𩲅', '𩳁', '𩳐', '𩵄', '𩵽', '𩷑', '𩸞', '𩹄', '𩹷', '𩿡', '𪆧', '𪉖', '𪊍', '𪊫', '𪔠', '𪕰', '𪙔', '𪙹', '𪙿', '𪚃', '𪚊', '𪚋', '𪚌', '𪚑', '𪚓', '𪚚', '𪚠', '𪚢', '𪚣', '𪚤', '𪚥', '𪜈', '𪫧', '𪷁', '𫗼', '𫗽', '𫗾', '𫗿', '𫘀', '𫘁', '𫘂', '𫘃', '𫛉', '𫠉', '𫠓', '馧']

stopwords = list(punctuation) + hiragana + katakana + cjk_punctuations + romanji + mccarl_stoplist
stopwords = set(stopwords)

with open('japanese_sample_text.txt') as fin:
    for line in fin:
        # Remove stopwords.
        characters = [char if char not in stopwords else '_' for char in line.strip()]
        words = [kanjiword for kanjiword in ''.join(characters).split('_') if kanjiword]
        if words:
            print (words)

[in]:

荒川支流である滝川の支流となっている。流路延長は5.0キロメートル、流域面積は9.8平方キロメートルである。流域は全て山地に属している。奥秩父を代表する沢登りスポットとなっている。流路にはホチの滝・トオの滝のほか、鍾乳洞「瀧谷洞」がある。昭和初期には原全教が「奥秩父」に豆焼川の紀行文を残している。

[out]:

['荒川支流', '滝川', '支流', '流路延長', '流域面積', '平方', '流域', '全', '山地', '属', '奥秩父', '代表', '沢登', '流路', '滝', '滝', '鍾乳洞', '瀧谷洞', '昭和初期', '原全教', '奥秩父', '豆焼川', '紀行文', '残']
like image 92
alvas Avatar answered Sep 28 '22 19:09

alvas