Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PHP regex performance

I have to take out some data from strings. Unfortunately data has realy unfriendly format. I had to create about 15 regural expressions placed in separate preg_replace. It's worth to say that they have many OR (|) within itself. My question is what should I do finally: combine all expressions into one and separate them using | or leave them as is - in individual preg_replace?

Is it very bad practice to create other expressions to keep clarity? I think maybe I could combine some expressions into the one but they become very complicated and not understanding.

For example I have:

$itemFullName = preg_replace("@^\b(([a-zA-Z]{1,3})?[0-9]{1,2}(\.|\-|X)[0-9]{1,2}(\s|\.|\-)?(X|x)?\s?[0-9]{1,3}\.?(([0-9]{1,3})?(X[0-9]{1,3})|(\s[0-9]\/[0-9]|\/[0-9]{1,3}))?(\s\#[0-9]{1,3}\/[0-9]{1,3})?)\s@", ' ', $itemFullName, -1, $sum);
like image 449
webrama.pl Avatar asked Nov 28 '22 14:11

webrama.pl


2 Answers

Untidy:

For starters your original PHP statement:

$itemFullName = preg_replace("@^\b(([a-zA-Z]{1,3})?[0-9]{1,2}(\.|\-|X)[0-9]{1,2}(\s|\.|\-)?(X|x)?\s?[0-9]{1,3}\.?(([0-9]{1,3})?(X[0-9]{1,3})|(\s[0-9]\/[0-9]|\/[0-9]{1,3}))?(\s\#[0-9]{1,3}\/[0-9]{1,3})?)\s@", ' ', $itemFullName, -1, $sum);

would be much more readable (and maintainable) if you write it in free-spacing mode with comments like so:

Tidy:

$itemFullName = preg_replace("/(?#!php re_item_tidy Rev:20180207_0700)
    ^                     # Anchor to start of string.
    \b                    # String must begin with a word char.
    (                     # $1: Unnecessary group.
      ([a-zA-Z]{1,3})?    # $2: Optional 1-3 alphas.
      [0-9]{1,2}          # 1-2 decimal digits.
      (\.|\-|X)           # $3: Either a dot, hyphen or X.
      [0-9]{1,2}          # One or two decimal digits.
      (\s|\.|\-)?         # $4: Optional whitespace, dot or hyphen.
      (X|x)?              # $5: Optional X or x.
      \s?[0-9]{1,3}\.?    # Optional whitespace, 1-3 digits, optional dot.
      (                   # $6: Optional ??? from 2 alternatives.
        ([0-9]{1,3})?     # Either a1of2 $7: Optional 1-3 digits.
        (X[0-9]{1,3})     # $8: X and 1-3 digits.
      | (                 # Or a2of2 $9: one ??? from 2 alternatives.
          \s[0-9]\/[0-9]  # Either a1of2.
        | \/[0-9]{1,3}    # Or a2of2.
        )                 # End $9: one ??? from 2 alternatives.
      )?                  # End $6: optional ??? from 2 alternatives.
      (                   # $10: Optional sequence.
        \s\#[0-9]{1,3}    # whitespace, hash, 1-3 digits.
        \/[0-9]{1,3}      # Forward slash, 1-3 digits.
      )?                  # End $10: Optional sequence
    )                     # End $1: Unnecessary group.
    \s                    # End with a single whitespace char.
    /x", ' ', $itemFullName, -1, $sum);

Critique:

This regex is really not bad performance-wise. It has a start of string anchor at the start which helps it fail quickly for non-matching strings. It also does not have any backtracking problems. However, there are a few minor improvements which can be made:

  • There are three groups of alternatives where each of the alternatives consists of only one character - each of these can be replaced with a simple character class.
  • There are 10 capture groups but preg_replace uses none of the captured data. These capture groups can be changed to be non-capturing.
  • There are several unnecessary groups which can be simply removed.
  • Group 2: ([a-zA-Z]{1,3})? can be written more simply as: [a-zA-Z]{0,3}. Group 7 has a similar construct.
  • The \b word boundary at the start is unnecessary.
  • With PHP, its best to enclose regex patterns inside single quoted strings. Double quoted strings have many metacharacters that must be escaped. Single quoted strings only have two: the single quote and the backslash.
  • There are a few unnecessarily escaped forward slashes.

Note also that you are using the $sum variable to count the number of replacements being made by preg_replace(). Since you have a ^ start of string anchor at the beginning of the pattern, you will only ever have one replacement because you have not specifid the 'm' multi-line modifier. I am assuming that you actually do want to perform multiple replacements (and count them in $sum), so I've added the 'm' modifier.

Here is an improved version incorporating these changes:

Tidier:

$itemFullName = preg_replace('%(?#!php/m re_item_tidier Rev:20180207_0700)
    ^                  # Anchor to start of string.
    [a-zA-Z]{0,3}      # Optional 1-3 alphas.
    [0-9]{1,2}         # 1-2 decimal digits.
    [.X-]              # Either a dot, hyphen or X.
    [0-9]{1,2}         # One or two decimal digits.
    [\s.-]?            # Optional whitespace, dot or hyphen.
    [Xx]?              # Optional X or x.
    \s?[0-9]{1,3}\.?   # Optional whitespace, 1-3 digits, optional dot.
    (?:                # Optional ??? from 2 alternatives.
      [0-9]{0,3}       # Either a1of2: Optional 1-3 digits
      X[0-9]{1,3}      # followed by X and 1-3 digits.
    | (?:              # Or a2of2: One ??? from 2 alternatives.
        \s[0-9]/[0-9]  # Either a1of2.
      | /[0-9]{1,3}    # Or a2of2.
      )                # End one ??? from 2 alternatives.
    )?                 # End optional ??? from 2 alternatives.
    (?:                # Optional sequence.
      \s\#[0-9]{1,3}   # whitespace, hash, 1-3 digits.
      /[0-9]{1,3}      # Forward slash, 1-3 digits.
    )?                 # End optional sequence
    \s                 # End with a single whitespace char.
    %xm', ' ', $itemFullName, -1, $sum);

Note however, that I don't think you'll see much if any performance improvements - your original regex is pretty good. Your performance issues are probably coming from some other aspect of your program.

Hope this helps.

Edit 2018-02-07: Removed extraneous double quote, added regex shebangs.

like image 105
ridgerunner Avatar answered Dec 05 '22 17:12

ridgerunner


My question is what should I do finally: combine all expressions into one and separate them using | or leave them as is - in individual preg_replace?

Keep the regular expressions in separate preg_replace() calls because this gives you more maintainability, readability and efficiency.

Using a lot of OR operators | in your regular expression is not performance friendly especially for large amounts of text because the regular expression engine has to apply at every character in the input, it has to apply every alternative in the OR operator's | list.

like image 41
Ibrahim Najjar Avatar answered Dec 05 '22 17:12

Ibrahim Najjar