Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

show duplicates from list in counting operation python

i have corpus_text with string of text, then i'm convert this to list with split on words

i need to count all of words, but my algorithm counting only unique

corpus_test = 'cat dog tiger tiger tiger cat dog lion'
corpus_test = [[word.lower() for word in corpus_test.split()]]
word_counts = defaultdict(int)
for rowt in corpus_test:
    for wordt in rowt:
        word_counts[wordt] += 1

        v_count = len(word_counts.keys())

        words_list = list(word_counts.keys())

        word_index = dict((word, i) for i, word in enumerate(words_list))

        index_word = dict((i, word) for i, word in enumerate(words_list))

and i want show you outputs from this algorithm

v_count
#4

words_list
#['cat', 'dog', 'tiger', 'lion']

word_counts
#defaultdict(int, {'cat': 2, 'dog': 2, 'tiger': 3, 'lion': 1})

word_index
#{'cat': 0, 'dog': 1, 'tiger': 2, 'lion': 3}

index_word
#{0: 'cat', 1: 'dog', 2: 'tiger', 3: 'lion'}

i need to have:

index_word
#{0: 'cat', 1: 'dog', 2: 'tiger', 3: 'tiger', 4: 'tiger', 5: 'cat', 6: 'dog', 7:'lion'}

and

v_count
#8
like image 799
germanjke Avatar asked Dec 13 '22 09:12

germanjke


2 Answers

If you want a map of indices to words, just... do that?

index_word = dict(enumerate(word.lower() for word in corpus_test.split()))

Or you have to store lists / sets of indices in your word_index, a dict is not a multimap, it maps a single key to a single value (though either can be composite).

Also word_counts could be a collection.Counter, it has useful features (like topN, or the ability to replicate / unfold items by their count).

like image 113
Masklinn Avatar answered Dec 24 '22 10:12

Masklinn


with the existing algorithm, you can try this.

index_word = dict((i, word) for i, word in enumerate(rowt)) 
v_count = len(index_word)
like image 26
Serkan Arslan Avatar answered Dec 24 '22 09:12

Serkan Arslan