Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex to match special characters EXCEPT hyphen(s) mixed with number(s)

Tags:

java

regex

We are currently using [^a-zA-Z0-9] in Java's replaceAll function to strip special characters from a string. It has come to our attention that we need to allow hyphen(s) when they are mixed with number(s).

Examples for which hyphens will not be matched:

  • 1-2-3
  • -1-23-4562
  • --1---2--3---4-
  • --9--a--7
  • 425-12-3456

Examples for which hyphens will be matched:

  • --a--b--c
  • wal-mart

We think we formulated a regex to meet the latter criteria using this SO question as a reference but we have no idea how to combine it with the original regex [^a-zA-Z0-9].

We are wanting to do this to a Lucene search string because of the way Lucene's standard tokenizer works when indexing:

Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split.

like image 766
theblang Avatar asked Dec 14 '25 06:12

theblang


2 Answers

You can't do this with a single regex. (Well... maybe in Perl.)

(edit: Okay, you can do it with variable-length negative lookbehind, which it appears Java can (almost uniquely!) do; see Cyborgx37's answer. Regardless, imo, you shouldn't do this with a single regex. :))

What you can do is split the string into words and deal with each word individually. My Java is pretty terrible so here is some hopefully-sensible Python:

# Precompile some regex
looks_like_product_number = re.compile(r'\A[-0-9]+\Z')
not_wordlike = re.compile(r'[^a-zA-Z0-9]')
not_wordlike_or_hyphen = re.compile(r'[^-a-zA-Z0-9]')

# Split on anything that's not a letter, number, or hyphen -- BUT dots
# must be followed by whitespace
words = re.split(r'(?:[^-.a-zA-Z0-9]|[.]\s)+', string)

stripped_words = []
for word in words:
    if '-' in word and not looks_like_product_number.match(word):
        stripped_word = not_wordlike.sub('', word)
    else:
        # Product number; allow dashes
        stripped_word = not_wordlike_or_hyphen.sub('', word)

    stripped_words.append(stripped_word)

pass_to_lucene(' '.join(stripped_words))

When I run this with 'wal-mart 1-2-3', I get back 'walmart 1-2-3'.

But honestly, the above code reproduces most of what the Lucene tokenizer is already doing. I think you'd be better off just copying StandardTokenizer into your own project and modifying it to do what you want.

like image 162
Eevee Avatar answered Dec 16 '25 18:12

Eevee


Have you tried this:

[^a-zA-Z0-9-]

like image 27
woemler Avatar answered Dec 16 '25 18:12

woemler



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!