Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ios app compile sqlite fts with icu,but it cant get the perfect answer when i input a letter like "z"

Tags:

sqlite

ios

icu

In sqlite I:

  1. Perform a create virtual MyTable (tokenize =icu ,id text,subject text,abstract text)
  2. Then successfully insert info MyTable (id,subject,abstract) values (?,?,?) so I have the row: 今天天气不错fmowomrogmeog,wfomgomrg,我是谁erz

When I perform select id from MyTable where MyTable match ‘z*’ it does not return anything,Whenever I search the single letter it returns nothing. However if I search ‘m’ or ‘天气’ or ‘天’,it works.

I know sqlite only support prefix, so I am using ICU. Am I making a mistake?

Note I've looked at the source code on foxmail,it looks to me like I can search ',' 'f' and so on.

like image 708
user1243169 Avatar asked Aug 22 '13 16:08

user1243169


2 Answers

Try Hai Feng Kao's character tokenizer. It can search prefix, postfix and anything in between. It supports Chinese as well. I don't think you can find any other tokenizers which support arbitrarily substring search.

BTW, it is a shameless self-promotion.

If you want to open a database encoded by character tokenizer in Objective-C, do the following:

#import <FMDB/FMDatabase.h>
#import "character_tokenizer.h"

FMDatabase* database = [[FMDatabase alloc] initWithPath:@"my_database.db"];
if ([database open]) {
    // add FTS support
    const sqlite3_tokenizer_module *ptr;
    get_character_tokenizer_module(&ptr);
    registerTokenizer(database.sqliteHandle, "character", ptr);
}
like image 138
Hai Feng Kao Avatar answered Oct 17 '22 11:10

Hai Feng Kao


You may also try FMDB's FMSimpleTokenizer. FMSimpleTokenizer uses build-in CFStringTokenizer and according to apple document "CFStringTokenizer allows you to tokenize strings into words, sentences or paragraphs in a language-neutral way. It supports languages such as Japanese and Chinese that do not delimit words by spaces"

If you check FMSimpleTokenizer code, you will find that is done by calling CFStringTokenizerAdvanceToNextToken & CFStringTokenizerGetCurrentTokenRange.

One interesting "fact" is how CFStringTokenizer tokenizes the Chinese words, for example "欢迎使用" will be tokenize into "欢迎" & "使用", which totally makes sense, but if you search "迎", you will be surprised to see no result at all!

In that case you probably need to write a tokenizer like Hai Feng Kao's sqlite tokenizer.

like image 23
Qiulang 邱朗 Avatar answered Oct 17 '22 09:10

Qiulang 邱朗