Generate ngrams with Julia

Question

To generate word bigrams in Julia, I could simply zip through the original list and a list that drops the first element, e.g.:

julia> s = split("the lazy fox jumps over the brown dog")
8-element Array{SubString{String},1}:
 "the"  
 "lazy" 
 "fox"  
 "jumps"
 "over" 
 "the"  
 "brown"
 "dog"  

julia> collect(zip(s, drop(s,1)))
7-element Array{Tuple{SubString{String},SubString{String}},1}:
 ("the","lazy")  
 ("lazy","fox")  
 ("fox","jumps") 
 ("jumps","over")
 ("over","the")  
 ("the","brown") 
 ("brown","dog")

To generate a trigram I could use the same collect(zip(...)) idiom to get:

julia> collect(zip(s, drop(s,1), drop(s,2)))
6-element Array{Tuple{SubString{String},SubString{String},SubString{String}},1}:
 ("the","lazy","fox")  
 ("lazy","fox","jumps")
 ("fox","jumps","over")
 ("jumps","over","the")
 ("over","the","brown")
 ("the","brown","dog")

But I have to manually add in the 3rd list to zip through, is there an idiomatic way such that I can do any order of n-gram?

e.g. I'll like to avoid doing this to extract 5-gram:

julia> collect(zip(s, drop(s,1), drop(s,2), drop(s,3), drop(s,4)))
4-element Array{Tuple{SubString{String},SubString{String},SubString{String},SubString{String},SubString{String}},1}:
 ("the","lazy","fox","jumps","over") 
 ("lazy","fox","jumps","over","the") 
 ("fox","jumps","over","the","brown")
 ("jumps","over","the","brown","dog")

Dan Getz · Accepted Answer

By changing the output slightly and using SubArrays instead of Tuples, little is lost, but it is possible to avoid allocations and memory copying. If the underlying word list is static, this is OK and faster (in my benchmarks too). The code:

ngram(s,n) = [view(s,i:i+n-1) for i=1:length(s)-n+1]

and the output:

julia> ngram(s,5)
 SubString{String}["the","lazy","fox","jumps","over"] 
 SubString{String}["lazy","fox","jumps","over","the"] 
 SubString{String}["fox","jumps","over","the","brown"]
 SubString{String}["jumps","over","the","brown","dog"]

julia> ngram(s,5)[1][3]
"fox"

For larger word lists the memory requirements are substantially smaller also.

Also note using a generator allows processing the ngrams one-by-one faster and with less memory and might be enough for the desired processing code (counting something or passing through some hash). For example, using @Gnimuc's solution without the collect i.e. just partition(s, n, 1).

Generate ngrams with Julia

Tags:

zip

nlp

julia

n-gram

alvas

1 Answers

Dan Getz

Recent Activity

Donate For Us

Generate ngrams with Julia

Tags:

zip

nlp

julia

n-gram

alvas

1 Answers

Dan Getz

Related questions

Recent Activity

Donate For Us