Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular Expressions in MarkLogic's xQuery

I am trying an XQuery using fn:matches with a regular expression, but the MarkLogic implementation of XQuery does not seem to allow hexidecimal character representations. The following gives me an "Invalid regular expression" error.

(: Find text containing non-ISO-Latin characters :)
let $regex := '[^\x00-\xFF]'
let $results := fn:collection('mydocs')//myns:myelem[fn:matches(., $regex)]
let $count := fn:count($results)

return
    <figures count="{$count}">
        { $results }
    </figures>

However, this one does not give the error.

let $regex := '[^a-zA-Z0-9]'
let $results := fn:collection('mydocs')//myns:myelem[fn:matches(., $regex)]
let $count := fn:count($results)

return
    <figures count="{$count}">
        { $results }
    </figures>

Is there a way to use the hexidecimal character representation, or an alternative that would give me the same result, in MarkLogic's implementation of XQuery?

like image 738
kalinma Avatar asked May 01 '15 18:05

kalinma


2 Answers

XQuery can use numeric character references in strings, in much the same way that XML and HTML can:

decimal: "&#10;" hex: "&#0a;" (or just "&#a;")

However, you can't represent some characters: <= "&#x09;", for instance.

There's no regex type in XQuery (you just use a string as a regex), so you can use character references in your regular expressions:

fn:matches("a", "[^&#x09;-&#xFF;]")

(: => xs:boolean("false") :)

Update: here's the XQuery 1.0 spec on character references: http://www.w3.org/TR/xquery/#dt-character-reference.

Based on some brief testing, I think MarkLogic enforces XML 1.1 character reference rules: http://www.w3.org/TR/xml11/#charsets

For posterity, here are the XML 1.0 rules: http://www.w3.org/TR/REC-xml/#charsets

like image 130
joemfb Avatar answered Oct 05 '22 20:10

joemfb


Well, it seems MarkLogic's implementation of xQuery wants Unicode. As it turned out, even very small ranges in hex(e.g., [^x00-x0F]) threw the "Invalid regular expression" error, but Unicode notation did not throw the error. The following give me results.

let $regex := '[^U0000-U00FF]'
let $results := fn:collection('mydocs')//myns:myelem[fn:matches(., $regex)]
let $count := fn:count($results)

return
    <figures count="{$count}">
        { $results }
    </figures>

I think that the mere assignment of let $regex := '[^\x00-\xFF]' did not throw the error because it was treated as a string when I tried return $regex.

like image 23
kalinma Avatar answered Oct 05 '22 21:10

kalinma