Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does CONTAINS find inequal text strings in JCR-SQL2?

Tags:

jcr

jcr-sql2

Working with a JCR-SQL2 query I noticed that the CONTAINS operator finds nodes which do not have exactly the same string that was in the condition.

Example

The following query:

SELECT * FROM [nt:base] AS s WHERE CONTAINS(s.*, 'my/search-expression')

would not find only nodes that contain the my/search-expression string, but also nodes with strings like my/another/search/expression.

Why does the query not find only the exact string provided? How could it be changed to narrow down the results?

This question is intended to be answered by myself, for knowledge sharing - but feel free to add your own answer or improve an existing one.

like image 892
pkalinow Avatar asked Mar 08 '23 05:03

pkalinow


2 Answers

An execution plan for the example query reveals the root cause of the problem:

  [nt:base] as [s] /* lucene:lucene(/oak:index/lucene) +:fulltext:my +:fulltext:search +:fulltext:expression ft:("my/search-expression") where contains([s].[*], 'my/search-expression') */

The CONTAINS operator triggers a full text search. Non-word characters, like "/" or "-", are used as word delimiters. As a result, the query looks for all nodes that contain the words: "my", "search" and "expression".

What can be done with it? There are several options.

1. Use double quotes

If you want to limit results to phrases with given words in exact order and without any other words between them, put the search expression inside double quotes:

SELECT * FROM [nt:base] AS s WHERE CONTAINS(s.*, '"my/search-expression"')

Now, the execution plan is different:

[nt:base] as [s] /* lucene:lucene(/oak:index/lucene) :fulltext:"my search expression" ft:("my/search-expression") where contains([s].[*], '"my/search-expression"') */

The query will now look for the whole phrase, not single words. However, it still ignores non-word characters, so such phrases would also be found: "my search expression" or "my-search-expression".

2. Use LIKE expression (not recommended)

If you want to find only the exact phrase, keeping non-word characters, you can use the LIKE expression:

SELECT * FROM [nt:base] AS s WHERE s.* LIKE '%my/search-expression%'

This is, however, much slower. I needed to add another condition to avoid timeout during explaining the execution plan. For this query:

SELECT * FROM [nt:base] AS s WHERE s.* LIKE '%my/search-expression%'  AND ISDESCENDANTNODE([/content/my/content]) 

the execution plan is:

[nt:base] as [s] /* traverse "/content/my/content//*" where ([s].[*] like '%my/search-expression%') and (isdescendantnode([s], [/content/my/content])) */

It would find only nodes with this phrase: "my/search-expression".

3. Use double quotes and refine the results

It would be probably better to use the first approach (CONTAINS with double quotes) and refine the results later, for example in application code if the query is run from an application.

4. Mix CONTAINS and LIKE

Another option is to mix full-text search and LIKE expression with AND:

SELECT * FROM [nt:base] AS s WHERE CONTAINS(s.*, '"my/search-expression"') AND s.* LIKE '%my/search-expression%'

The execution plan is now:

[nt:base] as [s] /* lucene:lucene(/oak:index/lucene) :fulltext:"my search expression" ft:("my/search-expression") where (contains([s].[*], '"my/search-expression"')) and ([s].[*] like '%my/search-expression%') */

Now, it should be fast and strict in the same time.

like image 195
pkalinow Avatar answered Mar 09 '23 17:03

pkalinow


Had the same problem.

So basically you should define different tokenizer for your lucene index, in my case "Whitespace" tokenizer was just fine.

With Standard tokenizer "my/search-expression" is splitted in 3 tokens "my", "search", "expression". Standard tokenizer use some special characters as delimiter.

Thats the reason why for "my/search-expression" you get 0 results.

Another example:

"some-other my search/expression" with Whitespace tokenizer this is splitted into:

"some-other", "my", "search/expression"

When you search for "some-other my" this should return results.

List of tokenizers

Lucene index example:

<yourLucene
jcr:primaryType="oak:QueryIndexDefinition"
type="lucene"
async="async"
evaluatePathRestrictions="{Boolean}true"
includedPaths="[/somepath]"
queryPaths="[/somepath]"
compatVersion="{Long}2">
<analyzers jcr:primaryType="nt:unstructured">
    <default jcr:primaryType="nt:unstructured">
        <tokenizer
            jcr:primaryType="nt:unstructured"
            name="Whitespace"/>
        <filters jcr:primaryType="nt:unstructured">
            <Standard jcr:primaryType="nt:unstructured"/>
            <LowerCase jcr:primaryType="nt:unstructured"/>
            <Stop jcr:primaryType="nt:unstructured"/>
        </filters>
    </default>
</analyzers>
<indexRules jcr:primaryType="nt:unstructured">
    <nt:unstructured jcr:primaryType="nt:unstructured">
        <properties jcr:primaryType="nt:unstructured">
            <someprop
                jcr:primaryType="nt:unstructured"
                name="someprop"
                propertyIndex="{Boolean}true"
                type="String"/>
        </properties>
    </nt:unstructured>
</indexRules>
like image 25
mkovacek Avatar answered Mar 09 '23 17:03

mkovacek