Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular expressions for a range of unicode points PHP

I'm trying to strip all characters from a string except:

  • Alphanumeric characters
  • Dollar sign ($)
  • Underscore (_)
  • Unicode characters between code points U+0080 and U+FFFF

I've got the first three conditions by doing this:

preg_replace('/[^a-zA-Z\d$_]+/', '', $foo);

How do I go about matching the fourth condition? I looked at using \X but there has to be a better way than listing out 65000+ characters.

like image 277
rink.attendant.6 Avatar asked Oct 20 '14 04:10

rink.attendant.6


People also ask

Does regex work with Unicode?

This will make your regular expressions work with all Unicode regex engines. In addition to the standard notation, \p{L}, Java, Perl, PCRE, the JGsoft engine, and XRegExp 3 allow you to use the shorthand \pL. The shorthand only works with single-letter Unicode properties.

How do you range a character in regex?

To show a range of characters, use square backets and separate the starting character from the ending character with a hyphen. For example, [0-9] matches any digit. Several ranges can be put inside square brackets. For example, [A-CX-Z] matches 'A' or 'B' or 'C' or 'X' or 'Y' or 'Z'.

What is the regex for Unicode paragraph seperator?

\u000d — Carriage return — \r. \u2028 — Line separator. \u2029 — Paragraph separator.

What is \p l in regex?

\p{L} matches a single code point in the category "letter". \p{N} matches any kind of numeric character in any script.


1 Answers

You can use:

$foo = preg_replace('/[^\w$\x{0080}-\x{FFFF}]+/u', '', $foo);
  • \w - is equivalent of [a-zA-Z0-9_]
  • \x{0080}-\x{FFFF} to match characters between code points U+0080andU+FFFF`
  • /u for unicode support in regex
like image 50
anubhava Avatar answered Sep 17 '22 18:09

anubhava