Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

analyzed or not_analyzed, what to choose

I'm using only kibana to search ElasticSearch and i have several fields that can only take a few values (worst case, servername, 30 different values).

I do understand what analyze do to bigger, more complex fields like this, but the small and simple ones i fail to understand the advance/disadvantage of anaylyzed/not_analyzed fields.

So what are the benefits of using analyzed and not_analyzed for a "limited set of values" field (example. servername: server[0-9]* , no special characters to break)? What kind of search types will i lose in kibana? Will i gain any search speed or disk space?

Testing on one of then i saw that the .raw version of the field is now empty but kibana still flags the field as analyzed, so i find my tests inconclusive.

like image 655
higuita Avatar asked May 30 '16 19:05

higuita


1 Answers

I will to try to keep it simple, if you need more clarification just let me know and I'll elaborate a better answer.

the "analyzed" field is going to create a token using the analyzer that you had defined for that specific table in your mapping. if you are using the default analyzer (as you refer to something without especial characters lets say server[1-9]) using the default analyzer (alnum-lowercase word-braker(this is not the name just what it does basically)) is going to tokenize :

this -> HelloWorld123
into -> token1:helloworld123

OR

this -> Hello World 123
into -> token1:hello && token2:world && token3:123

in this case if you do a search: HeLlO it will become -> "hello" and it will match this document because the token "hello" is there.

in the case of not_analized fields it doesnt apply any tokenizer at all, your token is your keyword so that being said:

this -> Hello World 123
into -> token1:(Hello World 123)

if you search that field for "hello world 123"

is not going to match because is "case sensitive" (you can still use wildcards though (Hello*), lets address that in another time).

in a nutshell:

use "analyzed" fields for fields that you are going to search and you want elasticsearch to score them. example: titles that contain the word "jobs". query:"title:jobs".

doc1 : title:developer jobs in montreal
doc2 : title:java coder jobs in vancuver
doc3 : title:unix designer jobs in toronto
doc4 : title:database manager vacancies in montreal

this is going to retrieve title1 title2 title3.

in those case "analyzed" fields is what you want.

if you know in advance what kind of data would be on that field and you're going to query exactly what you want then "not_analyzed" is what you want.

example:

get all the logs from server123.

query:"server:server123".

doc1 :server:server123,log:randomstring,date:01-jan
doc2 :server:server986,log:randomstring,date:01-jan
doc3 :server:server777,log:randomstring,date:01-jan
doc4 :server:server666,log:randomstring,date:01-jan
doc5 :server:server123,log:randomstring,date:02-jan

results only from server1 and server5.

and well i hope you get the point. as i said keep it simple is about what you need.

analyzed -> more space on disk (LOT MORE if the analyze filds are big). analyzed -> more time for indexation. analyzed -> better for matching documents.

not_analyzed -> less space on disk. not_analyzed -> less time for indexation. not_analyzed -> exact match for fields or using wildcards.

Regards,

Daniel

like image 176
Daniel Andres Acevedo Avatar answered Nov 14 '22 00:11

Daniel Andres Acevedo