Lets have a rather simple TDE schema:
<tde:template xmlns:tde='http://marklogic.com/xdmp/tde'>
<tde:description>testing a TDE view</tde:description>
<tde:context>/test</tde:context>
<tde:enabled>true</tde:enabled>
<tde:rows>
<tde:row>
<tde:schema-name>sitefusion</tde:schema-name>
<tde:view-name>test</tde:view-name>
<tde:columns>
<tde:column>
<tde:name>name</tde:name>
<tde:scalar-type>string</tde:scalar-type>
<tde:val>name</tde:val>
<tde:nullable>false</tde:nullable>
<tde:collation>http://marklogic.com/collation/en/S1</tde:collation>
</tde:column>
</tde:columns>
</tde:row>
</tde:rows>
</tde:template>
This schema has a collation http://marklogic.com/collation/en/S1
meaning it is case and diacritic insensitive
. With that in mind, i inserted a test document with a first char uppercase name Test
:
let $doc :=
<test>
<name>Test</name>
</test>
return xdmp:document-insert("/test/1.xml", $doc);
let $data := xdmp:sql("select name from test")
return json:array-pop($data)[2];
The output of that xquery is the expected value of name Test
.
Now lets update the name
and put a lowercase test
in it:
let $doc :=
<test>
<name>test</name>
</test>
return xdmp:document-insert("/test/1.xml", $doc);
let $data := xdmp:sql("select name from test")
return (
json:array-pop($data)[2],
doc("/test/1.xml")
)
This, for some reason does now also output Test
with a uppercase T
.
Why is this the case? The output of doc("/test/1.xml")
does show the correct document:
<test>
<name>test</name>
</test>
Is this the correct behaviour? Does collation not only affect comparing (like in a where clause), but also ingestion? Is there a way to have case-sensitive data and case-insensitive comparing ?
Edit:
Using MarkLogic 9.0-7.2
.
This is happening because the triple index stores unique values in a value dictionary for compression. Since "test" and "Test" are considered equal under the collation only one of the values will be stored in the value dictionary - in this case "Test".
By using a case and diacritic insensitive collation you are effectively saying that you don't care about case and diacritic differences. That means that you shouldn't care about the cases or diacritics used in the values returned from that column.
If you have a use case for retrieving the exact string as it appears in the original document, then add two columns - one that is case and diacritic sensitive, and one that uses the default codepoint collation. Search using the former, but return results from the latter.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With