Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MarkLogic sql S1 collation showing wrong values

Lets have a rather simple TDE schema:

<tde:template xmlns:tde='http://marklogic.com/xdmp/tde'>
  <tde:description>testing a TDE view</tde:description>
  <tde:context>/test</tde:context>
  <tde:enabled>true</tde:enabled>
  <tde:rows>
    <tde:row>
      <tde:schema-name>sitefusion</tde:schema-name>
      <tde:view-name>test</tde:view-name>
      <tde:columns>
        <tde:column>
          <tde:name>name</tde:name>
          <tde:scalar-type>string</tde:scalar-type>
          <tde:val>name</tde:val>
          <tde:nullable>false</tde:nullable>
          <tde:collation>http://marklogic.com/collation/en/S1</tde:collation>
        </tde:column>
      </tde:columns>
    </tde:row>
  </tde:rows>
</tde:template>

This schema has a collation http://marklogic.com/collation/en/S1 meaning it is case and diacritic insensitive. With that in mind, i inserted a test document with a first char uppercase name Test:

let $doc := 
<test>
  <name>Test</name>
</test>
return xdmp:document-insert("/test/1.xml", $doc);

let $data := xdmp:sql("select name from test")

return json:array-pop($data)[2];

The output of that xquery is the expected value of name Test.

Now lets update the name and put a lowercase test in it:

let $doc := 
<test>
  <name>test</name>
</test>
return xdmp:document-insert("/test/1.xml", $doc);

let $data := xdmp:sql("select name from test")

return (
   json:array-pop($data)[2], 
   doc("/test/1.xml")
)

This, for some reason does now also output Test with a uppercase T. Why is this the case? The output of doc("/test/1.xml") does show the correct document:

<test>
      <name>test</name>
</test>

Is this the correct behaviour? Does collation not only affect comparing (like in a where clause), but also ingestion? Is there a way to have case-sensitive data and case-insensitive comparing ?

Edit: Using MarkLogic 9.0-7.2.

like image 247
Wagner Michael Avatar asked Mar 06 '23 00:03

Wagner Michael


1 Answers

This is happening because the triple index stores unique values in a value dictionary for compression. Since "test" and "Test" are considered equal under the collation only one of the values will be stored in the value dictionary - in this case "Test".

By using a case and diacritic insensitive collation you are effectively saying that you don't care about case and diacritic differences. That means that you shouldn't care about the cases or diacritics used in the values returned from that column.

If you have a use case for retrieving the exact string as it appears in the original document, then add two columns - one that is case and diacritic sensitive, and one that uses the default codepoint collation. Search using the former, but return results from the latter.

like image 155
John Snelson Avatar answered May 09 '23 21:05

John Snelson