Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to escape "<" character in regex in Solr schema.xml?

Tags:

java

regex

solr

Using Solr 3.5.0 and in my schema.xml I'm using the following to mark the end of sentences and replace the end punctuation with a symbolic token:

<charFilter class="solr.PatternReplaceCharFilterFactory" 
pattern="(?<=[^.!?\\s][^.!?]*(?:[.!?](?![']?\s|$)[^.!?]*)*)[.!?]+(?=\\s|$)"
replacement=" monkeysentence"/>

I'm not sure if that will even work for what I want, but first I need to solve the problem of escaping the '<' character in the first '?<=' lookbehind.

I get the following error:

org.xml.sax.SAXParseException: The value of attribute "pattern" 
associated with an element type "null" must not contain the '<' character.

I've tried using a '\' as in:

 pattern="(?\<=[^.!?\\s][^.!?]*(?:[.!?](?![']?\s|$)[^.!?]*)*)[.!?]+(?=\\s|$)"

But I get the same error.

like image 935
OdieO Avatar asked Apr 19 '12 02:04

OdieO


1 Answers

As this is in an XML file, you will need to use an XML escape to encode <, namely &lt; (you may also need to encode > as &gt;, " as &quot;, and & as &amp;)

like image 132
Jonathan Callen Avatar answered Nov 09 '22 14:11

Jonathan Callen