Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Purpose of [^\x20-\x7E] in regular expressions

Tags:

regex

 [^\x20-\x7E] 

I saw this pattern used for a regular expression in which the goal was to remove non-ascii characters from a string. What does it mean?

like image 753
Chris Burgess Avatar asked Jun 11 '09 19:06

Chris Burgess


People also ask

What is ASCII x20?

\x20. Matches an ASCII character using hexadecimal representation (exactly two digits). \cC. Matches an ASCII control character. For example, \cCis control-C.

What is '?' In regular expression?

'?' is also a quantifier. Is short for {0,1}. It means "Match zero or one of the group preceding this question mark." It can also be interpreted as the part preceding the question mark is optional. e.g.: pattern = re.compile(r'(\d{2}-)?\

What is the use of * in regular expression?

*. * , returns strings beginning with any combination and any amount of characters (the first asterisk), and can end with any combination and any amount of characters (the last asterisk). This selects every single string available.

What are the two types of characters used in regular expression?

Each character in a regular expression (that is, each character in the string describing its pattern) is either a metacharacter, having a special meaning, or a regular character that has a literal meaning.


2 Answers

It says something like: all characters that are not (^) in the range \x20-\x7E (hex 0x20 to 0x7E).

According to http://www.asciitable.com/, those are characters from space to ~.

like image 63
Flavius Stef Avatar answered Sep 26 '22 14:09

Flavius Stef


It means match any characters that are not printing characters.

Printing characters include a to z, A to Z, 0 to 9 and symbols such as ",;$#% etc.

^ not \x20 hex code for space character - to  \x7e hex code for ~ (tilde) character 

All the ascii printing characters fall between these two.

This statement matches non ascii characters as well as ascii control (non printing) characters such as bell, tab, null and others.

Look at

man ascii 

on a unix system to see which characters it matches.

In perl, you could also write this as

[^ -~] 

or

[[:^cntrl:]] 

This last one is slightly different, in that it matches any non control character, including extended ascii (e.g. accented characters) and unicode.

You may not want to restrict yourself to just ascii, since non US locations often use valid printing characters outside this small range, e.g. øüéåç...

like image 42
Alex Brown Avatar answered Sep 26 '22 14:09

Alex Brown