Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NLTK makes it easy to compute bigrams of words. What about letters?

I've seen tons of documentation all over the web about how the python NLTK makes it easy to compute bigrams of words.

What about letters?

What I want to do is plug in a dictionary and have it tell me the relative frequencies of different letter pairs.

Ultimately I'd like to make some kind of markov process to generate likely-looking (but fake) words.

like image 354
isthmuses Avatar asked Jan 05 '13 04:01

isthmuses


2 Answers

Here is an example (modulo Relative Frequency Distribution) using Counter from the collections module:

#!/usr/bin/env python

import sys
from collections import Counter
from itertools import islice
from pprint import pprint

def split_every(n, iterable):
    i = iter(iterable)
    piece = ''.join(list(islice(i, n)))
    while piece:
        yield piece
        piece = ''.join(list(islice(i, n)))

def main(text):
    """ return ngrams for text """
    freqs = Counter()
    for pair in split_every(2, text): # adjust n here
        freqs[pair] += 1
    return freqs

if __name__ == '__main__':
    with open(sys.argv[1]) as handle:
        freqs = main(handle.read()) 
        pprint(freqs.most_common(10))

Usage:

$ python 14168601.py lorem.txt
[('t ', 32),
 (' e', 20),
 ('or', 18),
 ('at', 16),
 (' a', 14),
 (' i', 14),
 ('re', 14),
 ('e ', 14),
 ('in', 14),
 (' c', 12)]
like image 200
miku Avatar answered Sep 19 '22 01:09

miku


If bigrams is all you need, you don't need NLTK. You can simply do it as follows:

from collections import Counter
text = "This is some text"
bigrams = Counter(x+y for x, y in zip(*[text[i:] for i in range(2)]))
for bigram, count in bigrams.most_common():
    print bigram, count

Output:

is 2
s  2
me 1
om 1
te 1
 t 1
 i 1
e  1
 s 1
hi 1
so 1
ex 1
Th 1
xt 1
like image 32
vpekar Avatar answered Sep 21 '22 01:09

vpekar