Working with a JCR-SQL2 query I noticed that the CONTAINS
operator finds nodes
which do not have exactly the same string that was in the condition.
Example
The following query:
SELECT * FROM [nt:base] AS s WHERE CONTAINS(s.*, 'my/search-expression')
would not find only nodes that contain the my/search-expression
string, but also nodes with strings like my/another/search/expression
.
Why does the query not find only the exact string provided? How could it be changed to narrow down the results?
This question is intended to be answered by myself, for knowledge sharing - but feel free to add your own answer or improve an existing one.
An execution plan for the example query reveals the root cause of the problem:
[nt:base] as [s] /* lucene:lucene(/oak:index/lucene) +:fulltext:my +:fulltext:search +:fulltext:expression ft:("my/search-expression") where contains([s].[*], 'my/search-expression') */
The CONTAINS
operator triggers a full text search. Non-word characters, like "/" or "-", are used as word delimiters. As a result, the query looks for all nodes that contain the words: "my", "search" and "expression".
What can be done with it? There are several options.
If you want to limit results to phrases with given words in exact order and without any other words between them, put the search expression inside double quotes:
SELECT * FROM [nt:base] AS s WHERE CONTAINS(s.*, '"my/search-expression"')
Now, the execution plan is different:
[nt:base] as [s] /* lucene:lucene(/oak:index/lucene) :fulltext:"my search expression" ft:("my/search-expression") where contains([s].[*], '"my/search-expression"') */
The query will now look for the whole phrase, not single words. However, it still ignores non-word characters, so such phrases would also be found: "my search expression" or "my-search-expression".
If you want to find only the exact phrase, keeping non-word characters, you can use the LIKE
expression:
SELECT * FROM [nt:base] AS s WHERE s.* LIKE '%my/search-expression%'
This is, however, much slower. I needed to add another condition to avoid timeout during explaining the execution plan. For this query:
SELECT * FROM [nt:base] AS s WHERE s.* LIKE '%my/search-expression%' AND ISDESCENDANTNODE([/content/my/content])
the execution plan is:
[nt:base] as [s] /* traverse "/content/my/content//*" where ([s].[*] like '%my/search-expression%') and (isdescendantnode([s], [/content/my/content])) */
It would find only nodes with this phrase: "my/search-expression".
It would be probably better to use the first approach (CONTAINS
with double quotes) and refine the results later, for example in application code if the query is run from an application.
Another option is to mix full-text search and LIKE
expression with AND
:
SELECT * FROM [nt:base] AS s WHERE CONTAINS(s.*, '"my/search-expression"') AND s.* LIKE '%my/search-expression%'
The execution plan is now:
[nt:base] as [s] /* lucene:lucene(/oak:index/lucene) :fulltext:"my search expression" ft:("my/search-expression") where (contains([s].[*], '"my/search-expression"')) and ([s].[*] like '%my/search-expression%') */
Now, it should be fast and strict in the same time.
Had the same problem.
So basically you should define different tokenizer for your lucene index, in my case "Whitespace" tokenizer was just fine.
With Standard tokenizer "my/search-expression" is splitted in 3 tokens "my", "search", "expression". Standard tokenizer use some special characters as delimiter.
Thats the reason why for "my/search-expression" you get 0 results.
Another example:
"some-other my search/expression" with Whitespace tokenizer this is splitted into:
"some-other", "my", "search/expression"
When you search for "some-other my" this should return results.
List of tokenizers
Lucene index example:
<yourLucene
jcr:primaryType="oak:QueryIndexDefinition"
type="lucene"
async="async"
evaluatePathRestrictions="{Boolean}true"
includedPaths="[/somepath]"
queryPaths="[/somepath]"
compatVersion="{Long}2">
<analyzers jcr:primaryType="nt:unstructured">
<default jcr:primaryType="nt:unstructured">
<tokenizer
jcr:primaryType="nt:unstructured"
name="Whitespace"/>
<filters jcr:primaryType="nt:unstructured">
<Standard jcr:primaryType="nt:unstructured"/>
<LowerCase jcr:primaryType="nt:unstructured"/>
<Stop jcr:primaryType="nt:unstructured"/>
</filters>
</default>
</analyzers>
<indexRules jcr:primaryType="nt:unstructured">
<nt:unstructured jcr:primaryType="nt:unstructured">
<properties jcr:primaryType="nt:unstructured">
<someprop
jcr:primaryType="nt:unstructured"
name="someprop"
propertyIndex="{Boolean}true"
type="String"/>
</properties>
</nt:unstructured>
</indexRules>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With