Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

RegEx: Remove non-letters UTF-8 Safe, Quickly

Tags:

regex

php

utf-8

I'm trying to remove everything except valid letters (from any language) in PHP. I've been using this:

$content=preg_replace('/[^\pL\p{Zs}]/u', '', $content);

But it's painfully slow. Takes about 30x longer than:

$content=preg_replace('/[^a-z\s]/', '', $content);

I'm dealing with large amounts of data, so it really isn't feasible to use a slow method.

Is there a faster way of doing this?

like image 545
Alasdair Avatar asked Nov 12 '11 09:11

Alasdair


2 Answers

Well, it's a wonder it's only 30 times slower, seeing that it needs to take about 1000 times more characters than just a-z into account when checking if a certain code point is a letter or not.

That said, you can improve your regex a bit:

$content=preg_replace('/[^\pL\p{Zs}]+/u', '', $content);

should speed it up by combining adjacent non-letters/space separators into one single replace operation.

like image 114
Tim Pietzcker Avatar answered Sep 27 '22 20:09

Tim Pietzcker


You could try to use the new PCRE 8.20 version with the --enable-jit option. That will JIT compile the regex and might improve performance for you.

like image 34
NikiC Avatar answered Sep 27 '22 20:09

NikiC