Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Quick implementation of character n-grams for word

Tags:

I wrote the following code for computing character bigrams and the output is right below. My question is, how do I get an output that excludes the last character (ie t)? and is there a quicker and more efficient method for computing character n-grams?

b='student' >>> y=[] >>> for x in range(len(b)):     n=b[x:x+2]     y.append(n) >>> y ['st', 'tu', 'ud', 'de', 'en', 'nt', 't'] 

Here is the result I would like to get:['st','tu','ud','de','nt]

Thanks in advance for your suggestions.

like image 592
Tiger1 Avatar asked Sep 06 '13 12:09

Tiger1


People also ask

How do you use n-grams as a feature?

An n-gram is simply any sequence of n tokens (words). Consequently, given the following review text - “Absolutely wonderful - silky and sexy and comfortable”, we could break this up into: 1-grams: Absolutely, wonderful, silky, and, sexy, and, comfortable.

What is n-gram word?

An N-gram means a sequence of N words. So for example, “Medium blog” is a 2-gram (a bigram), “A Medium blog post” is a 4-gram, and “Write on Medium” is a 3-gram (trigram).

What is character n-grams?

Character N-grams (of at least 3 characters) that are common to words meaning “transport” in the same texts sample in French, Spanish and Greek and their respective frequency. Language.


1 Answers

To generate bigrams:

In [8]: b='student'  In [9]: [b[i:i+2] for i in range(len(b)-1)] Out[9]: ['st', 'tu', 'ud', 'de', 'en', 'nt'] 

To generalize to a different n:

In [10]: n=4  In [11]: [b[i:i+n] for i in range(len(b)-n+1)] Out[11]: ['stud', 'tude', 'uden', 'dent'] 
like image 170
NPE Avatar answered Sep 21 '22 15:09

NPE