I have to take out some data from strings. Unfortunately data has realy unfriendly format. I had to create about 15 regural expressions placed in separate preg_replace. It's worth to say that they have many OR (|) within itself. My question is what should I do finally: combine all expressions into one and separate them using | or leave them as is - in individual preg_replace?
Is it very bad practice to create other expressions to keep clarity? I think maybe I could combine some expressions into the one but they become very complicated and not understanding.
For example I have:
$itemFullName = preg_replace("@^\b(([a-zA-Z]{1,3})?[0-9]{1,2}(\.|\-|X)[0-9]{1,2}(\s|\.|\-)?(X|x)?\s?[0-9]{1,3}\.?(([0-9]{1,3})?(X[0-9]{1,3})|(\s[0-9]\/[0-9]|\/[0-9]{1,3}))?(\s\#[0-9]{1,3}\/[0-9]{1,3})?)\s@", ' ', $itemFullName, -1, $sum);
For starters your original PHP statement:
$itemFullName = preg_replace("@^\b(([a-zA-Z]{1,3})?[0-9]{1,2}(\.|\-|X)[0-9]{1,2}(\s|\.|\-)?(X|x)?\s?[0-9]{1,3}\.?(([0-9]{1,3})?(X[0-9]{1,3})|(\s[0-9]\/[0-9]|\/[0-9]{1,3}))?(\s\#[0-9]{1,3}\/[0-9]{1,3})?)\s@", ' ', $itemFullName, -1, $sum);
would be much more readable (and maintainable) if you write it in free-spacing mode with comments like so:
$itemFullName = preg_replace("/(?#!php re_item_tidy Rev:20180207_0700)
^ # Anchor to start of string.
\b # String must begin with a word char.
( # $1: Unnecessary group.
([a-zA-Z]{1,3})? # $2: Optional 1-3 alphas.
[0-9]{1,2} # 1-2 decimal digits.
(\.|\-|X) # $3: Either a dot, hyphen or X.
[0-9]{1,2} # One or two decimal digits.
(\s|\.|\-)? # $4: Optional whitespace, dot or hyphen.
(X|x)? # $5: Optional X or x.
\s?[0-9]{1,3}\.? # Optional whitespace, 1-3 digits, optional dot.
( # $6: Optional ??? from 2 alternatives.
([0-9]{1,3})? # Either a1of2 $7: Optional 1-3 digits.
(X[0-9]{1,3}) # $8: X and 1-3 digits.
| ( # Or a2of2 $9: one ??? from 2 alternatives.
\s[0-9]\/[0-9] # Either a1of2.
| \/[0-9]{1,3} # Or a2of2.
) # End $9: one ??? from 2 alternatives.
)? # End $6: optional ??? from 2 alternatives.
( # $10: Optional sequence.
\s\#[0-9]{1,3} # whitespace, hash, 1-3 digits.
\/[0-9]{1,3} # Forward slash, 1-3 digits.
)? # End $10: Optional sequence
) # End $1: Unnecessary group.
\s # End with a single whitespace char.
/x", ' ', $itemFullName, -1, $sum);
This regex is really not bad performance-wise. It has a start of string anchor at the start which helps it fail quickly for non-matching strings. It also does not have any backtracking problems. However, there are a few minor improvements which can be made:
([a-zA-Z]{1,3})?
can be written more simply as: [a-zA-Z]{0,3}
. Group 7 has a similar construct.\b
word boundary at the start is unnecessary.Note also that you are using the $sum
variable to count the number of replacements being made by preg_replace()
. Since you have a ^
start of string anchor at the beginning of the pattern, you will only ever have one replacement because you have not specifid the 'm'
multi-line modifier. I am assuming that you actually do want to perform multiple replacements (and count them in $sum
), so I've added the 'm'
modifier.
Here is an improved version incorporating these changes:
$itemFullName = preg_replace('%(?#!php/m re_item_tidier Rev:20180207_0700)
^ # Anchor to start of string.
[a-zA-Z]{0,3} # Optional 1-3 alphas.
[0-9]{1,2} # 1-2 decimal digits.
[.X-] # Either a dot, hyphen or X.
[0-9]{1,2} # One or two decimal digits.
[\s.-]? # Optional whitespace, dot or hyphen.
[Xx]? # Optional X or x.
\s?[0-9]{1,3}\.? # Optional whitespace, 1-3 digits, optional dot.
(?: # Optional ??? from 2 alternatives.
[0-9]{0,3} # Either a1of2: Optional 1-3 digits
X[0-9]{1,3} # followed by X and 1-3 digits.
| (?: # Or a2of2: One ??? from 2 alternatives.
\s[0-9]/[0-9] # Either a1of2.
| /[0-9]{1,3} # Or a2of2.
) # End one ??? from 2 alternatives.
)? # End optional ??? from 2 alternatives.
(?: # Optional sequence.
\s\#[0-9]{1,3} # whitespace, hash, 1-3 digits.
/[0-9]{1,3} # Forward slash, 1-3 digits.
)? # End optional sequence
\s # End with a single whitespace char.
%xm', ' ', $itemFullName, -1, $sum);
Note however, that I don't think you'll see much if any performance improvements - your original regex is pretty good. Your performance issues are probably coming from some other aspect of your program.
Hope this helps.
Edit 2018-02-07: Removed extraneous double quote, added regex shebangs.
My question is what should I do finally: combine all expressions into one and separate them using | or leave them as is - in individual preg_replace?
Keep the regular expressions in separate preg_replace()
calls because this gives you more maintainability, readability and efficiency.
Using a lot of OR operators |
in your regular expression is not performance friendly especially for large amounts of text because the regular expression engine has to apply at every character in the input, it has to apply every alternative in the OR operator's |
list.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With