Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SASI index in Cassandra and How it differs from normal indexing

I started using SASI indexing and used the following setup,

CREATE TABLE employee (
    id int,
    lastname text,
    firstname text,
    dateofbirth date,
    PRIMARY KEY (id, lastname, firstname)
) WITH CLUSTERING ORDER BY (lastname ASC, firstname ASC));

CREATE CUSTOM INDEX employee_firstname_idx ON employee (firstname) USING 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {'mode': 'CONTAINS', 'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer', 'case_sensitive': 'false'};

I perform the following query,

SELECT * FROM employee WHERE firstname like '%s';

As per my study, It seems the same as normal secondary indexing in Cassandra, Except providing the LIKE search,

1) Could somebody explain how it differs from normal secondary index in Cassandra?
2) What are the best configurations like mode, analyzer_class and case_sensitive - Any recommended documentation for this?

like image 244
Harry Avatar asked Feb 11 '18 18:02

Harry


2 Answers

1) Could somebody explain how it differs from normal secondary index in Cassandra?

Normal secondary index is essentially another lookup table comprising secondary index columns & primary key. Hence it has its own set of sstable files (disk), memtable (memory) and write overhead (cpu).

SASI was an improvement open sourced (contributed by Apple) to Cassandra community. This index gets created for every SSTable being flushed to disk and doesn't maintain a separate table. Hence less disk usage, no separate memtable/bloom filter/partition index (less memory) and minimal overhead.

2) What are the best configurations like mode, analyzer_class and case_sensitive - Any recommended documentation for this?

Configuration depends on your use case :-

Essentially there are three modes

  1. PREFIX - Used to serve LIKE queries based on prefix of indexed column
  2. CONTAINS - Used to serve LIKE queries based on whether the search term exists in the indexed column
  3. SPARSE - Used to index data that is sparse (every term/column value has less than 5 matching keys). For example range queries that span large timestamps.

Analyzer_class : Analyzers can be specified that will analyze the text in the specified column.

  1. The NonTokenizingAnalyzer is used for cases where the text is not analyzed, but case normalization or sensitivity is required.
  2. The StandardAnalyzer is used for analysis that involves stemming, case normalization, case sensitivity, skipping common words like "and" and "the", and localization of the language used to complete the analysis

case_sensitive : As name implies, whether the indexed column should be searched case insensitive. Applicable values are

  1. True
  2. False

Detailed documentation reference here and detailed blog post on performance.

like image 104
dilsingi Avatar answered Oct 21 '22 03:10

dilsingi


Here is a short summary of SASI from https://github.com/scylladb/scylla/wiki/Indexing-in-Cassandra-3:

SASI (acroynym of "SStable-Attached Secondary Indexing") is a reimplementation of the classic Cassandra secondary indexing with one main goal in mind - efficiently support more sophisticated search queries such as:

  • AND or OR combinations of queries.
  • Wildcard search in string values.
  • Range queries.
  • Lucene-inspired word search in string values (including word breaking, capitalization normalization, stemming, etc., as determined by a user-given "Analyzer").

Some of these things were already possible with secondary index, but inefficient, because required getting a long list of partitions, reading them (requiring inefficient seeks to each one) and filtering on them. SASI implement them using a new on-disk format based on B+ trees, and does not reuse regular Cassandra column families or sstables like the classic Secondary Indexing method did.

SASI attaches to each sstable its own immutable index file (and hence the name of this method), and also attaches an index to each memtable. During compaction, the indexes of the files being compacted together are also compacted to create one new index.

like image 27
Nadav Har'El Avatar answered Oct 21 '22 03:10

Nadav Har'El