Scroll to the end to skip the explanation. <h3>Background</h3> In my Android app, I want to use non-English Unicode text strings to search for matches in text documents/fields that are stored in a SQLite database. I've learned (so I thought) that what I need to do is implement a Full Text Search with fts3/fts4, so that is what I have been working on learning for the past couple days. FTS is supported by Android, as is shown in the documentation Storing and Searching for Data and in the blog post Android Quick Tip: Using SQLite FTS Tables. <h3>Problem</h3> Everything was looking good, but then I read the March 2012 blog post The sorry state of SQLite full text search on Android, which said <blockquote> The first step when building a full text search index is to break down the textual content into words, aka tokens. Those tokens are then entered into a special index which lets SQLite perform very fast searches based on a token (or a set of tokens). SQLite has two built-in tokenizers, and they both only consider tokens consisting of US ASCII characters. All other, non-US ASCII characters are considered whitespace. </blockquote> After that I also found this StackOverflow answer by @CL. (who, based on tags and reputation, appears to be an expert on SQLite) replying to a question about matching Vietnamese letters with different diacritics: <blockquote> You must create the FTS table with a tokenizer that can handle Unicode characters, i.e., ICU or UNICODE61. Please note that these tokenizers might not be available on all Android versions, and that the Android API does not expose any functions for adding user-defined tokenizers. </blockquote> This 2011 SO answer seems to confirm that Android does not support tokenizers beyond the two basic <code>simple</code> and <code>porter</code> ones. This is 2015. Are there any updates to this situation? I need to have the full text search supported for everyone using my app, not just people with new phones (even if the newest Android version does support it now). <h3>Potential partial solution?</h3> I find it hard to believe that FTS does not work at all with Unicode. The documentation for the <code>simple</code> tokenizer says <blockquote> A term is a contiguous sequence of eligible characters, where eligible characters are all alphanumeric characters and all characters with Unicode codepoint values greater than or equal to 128. All other characters are discarded when splitting a document into terms. Their only contribution is to separate adjacent terms. (emphasis added) </blockquote> That gives me hope that some basic Unicode functionality could still be supported in Android, even if things like capitalization and diacritics (and various other equivalent letter forms that have different Unicode code points) are not supported. <h3>My Main Question</h3> Can I use SQLite FTS in Android with non-English Unicode text (codepoints > 128) if I am only using literal Unicode string tokens separated by spaces? (That is, I am searching for exact strings that occur in the text.) <h3>Updates</h3> <ul> <li>The unicode61 tokenizer is available in SQLite version 3.7.13. This tokenizer supports "full unicode case folding" and "recognizes unicode space and punctuation characters." Android Lollipop (API 20+) uses SQLite 3.8. </li> </ul>

Supplemental Answer I ended up doing what @CL recommended and was able to successfully implement Full Text Search with Unicode. These are the basic steps I followed: <ol> <li>Replace all Unicode characters (>= 128) that are not parts of words with the space character.</li> <li>(optional) Replace specific characters with more general ones. For example, <code>ē</code>, <code>è</code>, and <code>é</code> could all be replaced with <code>e</code> (if this sort of generalized search is desired). This is not necessary, but if you don't do this, then searching for <code>é</code> will only return documents with <code>é</code>, and searching for <code>e</code> will only return documents with <code>e</code> (and not <code>é</code>).</li> <li>Populate the virtual FTS table using the modified text created in steps 1 and 2.</li> <li>Populate your normal table with unmodified text. The schema and number of documents must be the same as when you created the FTS table, of course.</li> <li>Link the virtual FTS table with your normal text table/column using an external content table so that you are not storing a copy of the modified text, only the document ids that were created from that text.</li> </ol> Please read Full text search example in Android for instructions in how to create the FTS table and link it to the normal table. This took a long time to figure out but in the end it made very fast full text searches even for a very large number of documents. If you need more details please leave a comment below.

Unicode characters are handled like 'normal' letters, so you can use them in FTS data and search terms. (Prefix searches should work, too.) The problem is that Unicode characters are not normalized, i.e., all characters are treated as letters (even if they actually are punctuation (―&dagger;), or other non-letter characters (☺♫)), and that upper/lowercase are not merged, and that diacritics are not removed. If you want to handle those cases correctly, you have to do these normalizations manually before you insert the documents into the database, and before you use the search terms.

Unicode support for non-English characters with Sqlite Full Text Search in Android

Background

In my Android app, I want to use non-English Unicode text strings to search for matches in text documents/fields that are stored in a SQLite database. I've learned (so I thought) that what I need to do is implement a Full Text Search with fts3/fts4, so that is what I have been working on learning for the past couple days. FTS is supported by Android, as is shown in the documentation Storing and Searching for Data and in the blog post Android Quick Tip: Using SQLite FTS Tables.

Problem

Everything was looking good, but then I read the March 2012 blog post The sorry state of SQLite full text search on Android, which said

The first step when building a full text search index is to break down the textual content into words, aka tokens. Those tokens are then entered into a special index which lets SQLite perform very fast searches based on a token (or a set of tokens).

SQLite has two built-in tokenizers, and they both only consider tokens consisting of US ASCII characters. All other, non-US ASCII characters are considered whitespace.

After that I also found this StackOverflow answer by @CL. (who, based on tags and reputation, appears to be an expert on SQLite) replying to a question about matching Vietnamese letters with different diacritics:

You must create the FTS table with a tokenizer that can handle Unicode characters, i.e., ICU or UNICODE61.

Please note that these tokenizers might not be available on all Android versions, and that the Android API does not expose any functions for adding user-defined tokenizers.

This 2011 SO answer seems to confirm that Android does not support tokenizers beyond the two basic simple and porter ones.

This is 2015. Are there any updates to this situation? I need to have the full text search supported for everyone using my app, not just people with new phones (even if the newest Android version does support it now).

Potential partial solution?

I find it hard to believe that FTS does not work at all with Unicode. The documentation for the simple tokenizer says

A term is a contiguous sequence of eligible characters, where eligible characters are all alphanumeric characters and all characters with Unicode codepoint values greater than or equal to 128. All other characters are discarded when splitting a document into terms. Their only contribution is to separate adjacent terms. (emphasis added)

That gives me hope that some basic Unicode functionality could still be supported in Android, even if things like capitalization and diacritics (and various other equivalent letter forms that have different Unicode code points) are not supported.

My Main Question

Can I use SQLite FTS in Android with non-English Unicode text (codepoints > 128) if I am only using literal Unicode string tokens separated by spaces? (That is, I am searching for exact strings that occur in the text.)

Updates

The unicode61 tokenizer is available in SQLite version 3.7.13. This tokenizer supports "full unicode case folding" and "recognizes unicode space and punctuation characters." Android Lollipop (API 20+) uses SQLite 3.8.

926

asked Apr 16 '15 08:04

Suragch

2 Answers

Supplemental Answer

I ended up doing what @CL recommended and was able to successfully implement Full Text Search with Unicode. These are the basic steps I followed:

Replace all Unicode characters (>= 128) that are not parts of words with the space character.
(optional) Replace specific characters with more general ones. For example, ē, è, and é could all be replaced with e (if this sort of generalized search is desired). This is not necessary, but if you don't do this, then searching for é will only return documents with é, and searching for e will only return documents with e (and not é).
Populate the virtual FTS table using the modified text created in steps 1 and 2.
Populate your normal table with unmodified text. The schema and number of documents must be the same as when you created the FTS table, of course.
Link the virtual FTS table with your normal text table/column using an external content table so that you are not storing a copy of the modified text, only the document ids that were created from that text.

Please read Full text search example in Android for instructions in how to create the FTS table and link it to the normal table. This took a long time to figure out but in the end it made very fast full text searches even for a very large number of documents.

If you need more details please leave a comment below.

194

answered Nov 14 '22 20:11

Suragch

Unicode characters are handled like 'normal' letters, so you can use them in FTS data and search terms. (Prefix searches should work, too.)

The problem is that Unicode characters are not normalized, i.e., all characters are treated as letters (even if they actually are punctuation (―†), or other non-letter characters (☺♫)), and that upper/lowercase are not merged, and that diacritics are not removed.
If you want to handle those cases correctly, you have to do these normalizations manually before you insert the documents into the database, and before you use the search terms.

answered Nov 14 '22 21:11

CL.

Related questions
                            
                                Difference between InstrumentationTestCase and AndroidTestCase
                            
                                Retrofit POST with a json object containing parameters
                            
                                Should I worry about memory leaks and using WeakReference with Volley in Android
                            
                                Android -RecyclerView in kitkat
                            
                                "@android:style/TextAppearance.StatusBar.EventContent.Title" sets the color to white instead of grey in android L
                            
                                How to store datetime in SQLite
                            
                                Android Programming - Making a URI to get audio location
                            
                                Using a resource folder in test project for test string data
                            
                                java.lang.UnsatisfiedLinkError when using with android 5.0
                            
                                New version of SwipeRefreshLayout causes wrong draw of views
                            
                                How to add a long text with many paragraphs in string.xml in android
                            
                                GreenDAO groupby clause
                            
                                MediaPlayer won't loop. setLooping() doesn't work
                            
                                Android Key Hash for Facebook with Cordova
                            
                                How to convert Joda instant to LocalDate and vice-versa?
                            
                                Detect if Android SurfaceView is drawing/moving
                            
                                Callback From Activity on Cordova
                            
                                java.io.IOException: read failed, socket might closed or timeout, read ret: -1 on Android 5.0.1 Lollipop version
                            
                                NullPointerException when creating GoogleApiClient instance
                            
                                build gradle variables to be used in code depending on flavor AND build type

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Unicode support for non-English characters with Sqlite Full Text Search in Android

Tags:

android

sqlite

full-text-search

unicode

Background

Problem

Potential partial solution?

My Main Question

Updates

Suragch

People also ask

2 Answers

Suragch

CL.

Recent Activity

Donate For Us