Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex to match words with punctuation but not punctuation alone

Tags:

regex

swift

I need to match words from a string that potentially have symbols, both in a word also as a punctuation. An example string could be:

This string's is a good example of situation I'll fail to match - due to punctuations being all over the place.

Ignoring the weird English of that sentence, I have a case where I need to match every word but not punctuation unless it is part of the word. So my result should be

  1. This
  2. string's (match 's since it is part of the word)
  3. is
  4. a
  5. good

...

  1. I'll (match the 'll to I since it is part of the word)
  2. fail
  3. to
  4. match
  5. due (skip the -)
  6. to

...

  1. the
  2. place (no full stop since it is not part of the word.)

I manage to come up with two regex that works partially but neither works the way I want

(?<=\\s|^)[A-Za-z0-9]+?(?=\\s|$) - I am using swift so `\\s` is for whitespace

This matches normal words but not cases like string's since there is a ' in the word. But if I use my other expression:

(?<=\\s|^).+?(?=\\s|$)

It matches string's but also - and the full stop at the end of sentences like place.

Is there an expression that matches words with punctuation but not punctuation alone? I do not mind if it requires multiple expression to capture all the results, I can merge the result before displaying.

Note: Apart from the example given, punctuations that I know could both exist as part of a word or not includes - ' () . whereas punctuation that will only be part of a word includes % $ # & / any other punctuation can be assumed to never be part of any word. ! ? " : are expected to appear with or without spacing from words but must not be matched into the result.

Fortunately, the string can be safely assume to only contain alphanumerics and punctuation symbols - other language characters and symbols like <>{}[]| or +*= can be assumed to not appear in the string - there are some other symbol that fits into one of the four groups of symbol that I cannot predict now but I believe if I can get a logic that works I can adapt it to include more symbols in each group.

like image 645
Ben Ong Avatar asked Feb 03 '17 07:02

Ben Ong


2 Answers

It seems that you need a regex that will be able to match selected symbols that are either preceded or followed with "word" characters, or just letters/digits or parentheses letters/digits. All that should be either inside whitespaces, start/end of string positions, or word boundaries (note this order is important).

Use

(?<=\\s|^|\\b)(?:[-'.%$#&/]\\b|\\b[-'.%$#&/]|[A-Za-z0-9]|\\([A-Za-z0-9]+\\))+(?=\\s|$|\\b)

See the regex demo.

Details:

  • (?<=\\s|^|\\b) - a positive lookbehind requiring that there must be whitespace, or start of string or a word boundary to the left of the current location
  • (?: - start of the non-capturing group, matching 1+ sequences of:
    • [-'.%$#&/]\\b - the specified symbols followed with a word char
    • | - or
    • \\b[-'.%$#&/] - the specified symbols preceded with a word char
    • | - or
    • [A-Za-z0-9] - an alphanumeric
    • | - or
    • \\([A-Za-z0-9]+\\) - a (, followed with 1+ alphanumeric symbols, and a )
  • )+ - end of the non-capturing group
  • (?=\\s|$|\\b) - a positive lookahead requiring that there must be a whitespace, end of string or a word boundary immediately to the right of the current location.

To only match dots as decimal separators, add an additional \d*\.?\d+ branch and take out . from the character classes:

(?<=\\s|^|\\b)(?:[-'%$#&/]\\b|\\b[-'%$#&/]|\\d*\\.?\\d+|[A-Za-z0-9]|\\([A-Za-z0-9]+\\))+(?=\\s|$|\\b)

See this regex demo

like image 107
Wiktor Stribiżew Avatar answered Oct 05 '22 23:10

Wiktor Stribiżew


Assuming there is a maximum of one punctuation symbols in a word, you can try :

(?<=\\s|^) ([A-Za-z0-9]+? | [A-Za-z0-9]*?[\-\'\(\)\.\%\$\#\&\/][A-Za-z0-9]*? ) (?=\\s|$) 

But Wiktor Stribiżew solution is better :

(?<=\\s|^|\\b)(?:[-'.%$#&/]\\b|\\b[-'.%$#&/]|[A-Za-z0-9]|\\(‌​[A-Za-z0-9]+\\))+(?=‌​\\s|$|\\b)
like image 25
baddger964 Avatar answered Oct 06 '22 01:10

baddger964