I'm trying to remove everything except valid letters (from any language) in PHP. I've been using this:
$content=preg_replace('/[^\pL\p{Zs}]/u', '', $content);
But it's painfully slow. Takes about 30x longer than:
$content=preg_replace('/[^a-z\s]/', '', $content);
I'm dealing with large amounts of data, so it really isn't feasible to use a slow method.
Is there a faster way of doing this?
Well, it's a wonder it's only 30 times slower, seeing that it needs to take about 1000 times more characters than just a-z
into account when checking if a certain code point is a letter or not.
That said, you can improve your regex a bit:
$content=preg_replace('/[^\pL\p{Zs}]+/u', '', $content);
should speed it up by combining adjacent non-letters/space separators into one single replace operation.
You could try to use the new PCRE 8.20 version with the --enable-jit
option. That will JIT compile the regex and might improve performance for you.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With