Regex to match special characters EXCEPT hyphen(s) mixed with number(s)

Question

We are currently using [^a-zA-Z0-9] in Java's replaceAll function to strip special characters from a string. It has come to our attention that we need to allow hyphen(s) when they are mixed with number(s).

Examples for which hyphens will not be matched:

1-2-3
-1-23-4562
--1---2--3---4-
--9--a--7
425-12-3456

Examples for which hyphens will be matched:

--a--b--c
wal-mart

We think we formulated a regex to meet the latter criteria using this SO question as a reference but we have no idea how to combine it with the original regex [^a-zA-Z0-9].

We are wanting to do this to a Lucene search string because of the way Lucene's standard tokenizer works when indexing:

Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split.

Eevee · Accepted Answer

You can't do this with a single regex. (Well... maybe in Perl.)

(edit: Okay, you can do it with variable-length negative lookbehind, which it appears Java can (almost uniquely!) do; see Cyborgx37's answer. Regardless, imo, you shouldn't do this with a single regex. :))

What you can do is split the string into words and deal with each word individually. My Java is pretty terrible so here is some hopefully-sensible Python:

# Precompile some regex
looks_like_product_number = re.compile(r'\A[-0-9]+\Z')
not_wordlike = re.compile(r'[^a-zA-Z0-9]')
not_wordlike_or_hyphen = re.compile(r'[^-a-zA-Z0-9]')

# Split on anything that's not a letter, number, or hyphen -- BUT dots
# must be followed by whitespace
words = re.split(r'(?:[^-.a-zA-Z0-9]|[.]\s)+', string)

stripped_words = []
for word in words:
    if '-' in word and not looks_like_product_number.match(word):
        stripped_word = not_wordlike.sub('', word)
    else:
        # Product number; allow dashes
        stripped_word = not_wordlike_or_hyphen.sub('', word)

    stripped_words.append(stripped_word)

pass_to_lucene(' '.join(stripped_words))

When I run this with 'wal-mart 1-2-3', I get back 'walmart 1-2-3'.

But honestly, the above code reproduces most of what the Lucene tokenizer is already doing. I think you'd be better off just copying StandardTokenizer into your own project and modifying it to do what you want.

woemler · Answer

Have you tried this:

[^a-zA-Z0-9-]

Regex to match special characters EXCEPT hyphen(s) mixed with number(s)

Tags:

java

regex

theblang

2 Answers

Eevee

woemler

Recent Activity

Donate For Us

Regex to match special characters EXCEPT hyphen(s) mixed with number(s)

Tags:

java

regex

theblang

2 Answers

Eevee

woemler

Related questions

Recent Activity

Donate For Us