Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Generating n-grams from a string

I need to make a list of all 𝑛 -grams beginning at the head of string for each integer 𝑛 from 1 to M. Then return a tuple of M such lists.

    def letter_n_gram_tuple(s, M):
        s = list(s)
        output = []
    for i in range(0, M+1):

        output.append(s[i:])

    return(tuple(output))

From letter_n_gram_tuple("abcd", 3) output should be:

(['a', 'b', 'c', 'd'], ['ab', 'bc', 'cd'], ['abc', 'bcd']))

However, my output is:

(['a', 'b', 'c', 'd'], ['b', 'c', 'd'], ['c', 'd'], ['d']).

Should I use string slicing and then saving slices into the list?

like image 791
Dari Obukhova Avatar asked Jan 27 '23 08:01

Dari Obukhova


2 Answers

you can use nested for, first for about n-gram, second to slice the string

def letter_n_gram_tuple(s, M):
    output = []

    for i in range(1, M + 1):
        gram = []
        for j in range(0, len(s)-i+1):
            gram.append(s[j:j+i])
        output.append(gram)

    return tuple(output)

or just one line by list comprehension:

output = [[s[j:j+i] for j in range(0, len(s)-i+1)] for i in range(1, M + 1)]

or use windowed in more_itertools:

import more_itertools
output = [list(more_itertools.windowed(s, i)) for i in range(1, M + 1)]

test and output:

print(letter_n_gram_tuple("abcd", 3))
(['a', 'b', 'c', 'd'], ['ab', 'bc', 'cd'], ['abc', 'bcd'])
like image 185
recnac Avatar answered Jan 28 '23 21:01

recnac


You need one more for loop to iterate over letters or str :

def letter_n_gram_tuple(s, M):
    output = []
    for i in range(0, M):
        vals = [s[j:j+i+1] for j in range(len(s)) if len(s[j:j+i+1]) == i+1]
        output.append(vals)

    return tuple(output)

print(letter_n_gram_tuple("abcd", 3))

Output:

(['a', 'b', 'c', 'd'], ['ab', 'bc', 'cd'], ['abc', 'bcd'])
like image 22
Sociopath Avatar answered Jan 28 '23 22:01

Sociopath