We are currently using [^a-zA-Z0-9] in Java's replaceAll function to strip special characters from a string. It has come to our attention that we need to allow hyphen(s) when they are mixed with number(s).
Examples for which hyphens will not be matched:
Examples for which hyphens will be matched:
We think we formulated a regex to meet the latter criteria using this SO question as a reference but we have no idea how to combine it with the original regex [^a-zA-Z0-9].
We are wanting to do this to a Lucene search string because of the way Lucene's standard tokenizer works when indexing:
Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split.
You can't do this with a single regex. (Well... maybe in Perl.)
(edit: Okay, you can do it with variable-length negative lookbehind, which it appears Java can (almost uniquely!) do; see Cyborgx37's answer. Regardless, imo, you shouldn't do this with a single regex. :))
What you can do is split the string into words and deal with each word individually. My Java is pretty terrible so here is some hopefully-sensible Python:
# Precompile some regex
looks_like_product_number = re.compile(r'\A[-0-9]+\Z')
not_wordlike = re.compile(r'[^a-zA-Z0-9]')
not_wordlike_or_hyphen = re.compile(r'[^-a-zA-Z0-9]')
# Split on anything that's not a letter, number, or hyphen -- BUT dots
# must be followed by whitespace
words = re.split(r'(?:[^-.a-zA-Z0-9]|[.]\s)+', string)
stripped_words = []
for word in words:
if '-' in word and not looks_like_product_number.match(word):
stripped_word = not_wordlike.sub('', word)
else:
# Product number; allow dashes
stripped_word = not_wordlike_or_hyphen.sub('', word)
stripped_words.append(stripped_word)
pass_to_lucene(' '.join(stripped_words))
When I run this with 'wal-mart 1-2-3', I get back 'walmart 1-2-3'.
But honestly, the above code reproduces most of what the Lucene tokenizer is already doing. I think you'd be better off just copying StandardTokenizer into your own project and modifying it to do what you want.
Have you tried this:
[^a-zA-Z0-9-]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With