Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using regex to check comma's usage

Tags:

regex

php

How can I write a regular expression that spots incorrect usage of a comma in a string, ie.: 1. for non-numbers, no space before and 1 space after; 2. for numbers, commas are allowed if preceded by 1-3 digits and followed by 3 digits.

Some test cases:

  • hello, world
  • hello,world => incorrect
  • hello ,world => incorrect
  • 1,234 worlds
  • 1,23 worlds => incorrect
  • 1,2345 worlds => incorrect
  • hello,123 worlds => incorrect
  • hello, 1234,567 worlds => incorrect
  • hello, 12,34,567 worlds => incorrect
  • (new test case) hello 1, 2, and 3 worlds
  • (new test case) hello $1,234 worlds
  • (new test case) hello $1,2345 worlds => incorrect
  • (new test case) hello "1,234" worlds
  • (new test case) hello "1,23" worlds => incorrect

So I thought I'd have a regex to capture words with bad syntax via (?![\S\D],[\S\D]) (capture where there's a non-space/digit followed by a comma by a non-space/digit), and join that with another regex to capture numbers with bad syntax, via (?!(.?^(?:\d+|\d{1,3}(?:,\d{3}))(?:.\d+). Putting that together gets me

preg_match_all("/(?![\S\D],[\S\D])|(?!(.*?^(?:\d+|\d{1,3}(?:,\d{3})*)(?:\.\d+)?$))/",$str,$syntax_result);

.. but obviously it doesn't work. How should it be done?

================EDIT================

Thanks to Casimir et Hippolyte's answer below, I got it to work! I've updated his answer to take care of more corner cases. Idk if the syntax I added is the most efficient, but it works, for now. I'll update this as more corner cases come up!

$pattern = <<<'LOD'
~
(?: # this group contains allowed commas
    [\w\)]+,((?=[ ][\w\s\(\"]+)|(?=[\s]+))  # comma between words or line break
  |
    (?<=^|[^\PP,]|[£$\s]) [0-9]{1,3}(?:,[0-9]{3})* (?=[€\s]|[^\PP,]|$) # thousands separator
) (*SKIP) (*FAIL) # make the pattern fail and forbid backtracking
| , # other commas
~mx
LOD;
like image 642
Alex Avatar asked Dec 08 '13 00:12

Alex


People also ask

How do I find commas in regex?

The 0-9 indicates characters 0 through 9, the comma , indicates comma, and the semicolon indicates a ; . The closing ] indicates the end of the character set. The plus + indicates that one or more of the "previous item" must be present.

How do you use semicolons in regular expressions?

Thus, if you use a semicolon (;) in a keyword expression, it will split the keywords into multiple parts. Semicolon is not in RegEx standard escape characters. It can be used normally in regular expressions, but it has a different function in HES so it cannot be used in expressions.

How do you use a colon in regex?

A colon has no special meaning in Regular Expressions, it just matches a literal colon.

How do you match periods in regex?

Use the escape character \ to match a period with regex within a regular expression to match a literal period since, by default, the dot . is a metacharacter in regex that matches any character except a newline.


1 Answers

It isn't waterproof, but this can give you an idea on how to proceed:

$pattern = <<<'LOD'
~
(?: # this group contains allowed commas
    \w+,(?=[ ]\w+)  # comma between words
  |
    (?<=^|[^\PP,]|[£$\s]) [0-9]{1,3}(?:,[0-9]{3})* (?=[€\s]|[^\PP,]|$) # thousands separator
) (*SKIP) (*FAIL) # make the pattern fail and forbid backtracking
| , # other commas
~mx
LOD;

preg_match_all($pattern, $text, $matches, PREG_OFFSET_CAPTURE);

print_r($matches[0]);

The idea is to exclude allowed commas from the match result to only obtain incorrect commas. The first non-capturing group contains a kind of blacklist for correct situations. (You can easily add other cases).

[^\PP,] means "all punctuation characters except ,", but you can replace this character class by a more explicit list of allowed characters, example : [("']

You can find more informations about (*SKIP) and (*FAIL) here and here.

like image 71
Casimir et Hippolyte Avatar answered Oct 02 '22 00:10

Casimir et Hippolyte