If I wanted to get all the tokens of the index that elasticsearch creates (I'm using the rails elasticsearch gem), how would I go about doing that? Doing something like this only gets a particular set of tokens for a search term:
curl -XGET 'http://localhost:9200/development_test/_analyze?text=John Smith'
You can combine the Scroll API with the Term Vectors API to enumerate terms in the inverted index:
require "elastomer/client"
require "set"
client = Elastomer::Client.new({ :url => "http://localhost:9200" })
index = "someindex"
type = "sometype"
field = "somefield"
terms = Set.new
client.scan(nil, :index => index, :type => type).each_document do |document|
term_vectors = client.index(index).docs(type).termvector({ :fields => field, :id => document["_id"] })["term_vectors"]
if term_vectors.key?(field)
term_vectors[field]["terms"].keys.each do |term|
unless terms.include?(term)
terms << term
puts(term)
end
end
end
end
This is rather slow and wasteful since it performs a _termvectors
HTTP request for every single document in the index, holds all the terms in RAM, and keeps a scroll context open for the duration of enumeration. However, this doesn't require another tool like Luke and the terms can be streamed out of the index.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With