Using Solr 3.5.0 and in my schema.xml I'm using the following to mark the end of sentences and replace the end punctuation with a symbolic token:
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="(?<=[^.!?\\s][^.!?]*(?:[.!?](?![']?\s|$)[^.!?]*)*)[.!?]+(?=\\s|$)"
replacement=" monkeysentence"/>
I'm not sure if that will even work for what I want, but first I need to solve the problem of escaping the '<' character in the first '?<=' lookbehind.
I get the following error:
org.xml.sax.SAXParseException: The value of attribute "pattern"
associated with an element type "null" must not contain the '<' character.
I've tried using a '\' as in:
pattern="(?\<=[^.!?\\s][^.!?]*(?:[.!?](?![']?\s|$)[^.!?]*)*)[.!?]+(?=\\s|$)"
But I get the same error.
As this is in an XML file, you will need to use an XML escape to encode <
, namely <
(you may also need to encode >
as >
, "
as "
, and &
as &
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With