Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understand this RegEx statement

I'm trying to understand this RegEx statement in details. It's supposed to validate filename from ASP.Net FileUpload control to allow only jpeg and gif files. It was designed by somebody else and I do not completely understand it. It works fine in Internet Explorer 7.0 but not in Firefox 3.6.

<asp:RegularExpressionValidator id="FileUpLoadValidator" runat="server" 
     ErrorMessage="Upload Jpegs and Gifs only." 
     ValidationExpression="^(([a-zA-Z]:)|(\\{2}\w+)\$?)(\\(\w[\w].*))(.jpg|.JPG|.gif|.GIF)$"
     ControlToValidate="LogoFileUpload">
</asp:RegularExpressionValidator>
like image 645
myforums Avatar asked Feb 18 '10 15:02

myforums


2 Answers

Here's a short explanation:

^               # match the beginning of the input
(               # start capture group 1
  (             #   start capture group 2
    [a-zA-Z]    #     match any character from the set {'A'..'Z', 'a'..'z'}
    :           #     match the character ':'
  )             #   end capture group 2
  |             #   OR
  (             #   start capture group 3
    \\{2}       #     match the character '\' and repeat it exactly 2 times
    \w+         #     match a word character: [a-zA-Z_0-9] and repeat it one or more times
  )             #   end capture group 3
  \$?           #   match the character '$' and match it once or none at all
)               # end capture group 1
(               # start capture group 4
  \\            #   match the character '\'
  (             #   start capture group 5
    \w          #     match a word character: [a-zA-Z_0-9] 
    [\w]        #     match any character from the set {'0'..'9', 'A'..'Z', '_', 'a'..'z'}
    .*          #     match any character except line breaks and repeat it zero or more times
  )             #   end capture group 5
)               # end capture group 4
(               # start capture group 6
  .             #   match any character except line breaks
  jpg           #   match the characters 'jpg'
  |             #   OR
  .             #   match any character except line breaks
  JPG           #   match the characters 'JPG'
  |             #   OR
  .             #   match any character except line breaks
  gif           #   match the characters 'gif'
  |             #   OR
  .             #   match any character except line breaks
  GIF           #   match the characters 'GIF'
)               # end capture group 6
$               # match the end of the input

EDIT

As some of the comments request, the above is generated by a little tool I wrote. You can download is here: http://www.big-o.nl/apps/pcreparser/pcre/PCREParser.html (WARNING: heavily under development!)

EDIT 2

It will match strings like these:

x:\abc\def\ghi.JPG
c:\foo\bar.gif
\\foo$\baz.jpg

Here's what the groups 1, 4 and 6 match individually:

group 1 | group 4      | group 6
--------+--------------+--------
        |              |
 x:     | \abc\def\ghi | .JPG
        |              |
 c:     | \foo\bar     | .gif
        |              |
 \\foo$ | \baz         | .jpg
        |              |

Note that it also matches a string like c:\foo\bar@gif since the DOT matches any character (except line breaks). And it will reject a string like c:\foo\bar.Gif (capital G in gif).

like image 88
Bart Kiers Avatar answered Oct 23 '22 00:10

Bart Kiers


This is a bad regex.

^(([a-zA-Z]:)|(\\{2}\w+)\$?)(\\(\w[\w].*))(.jpg|.JPG|.gif|.GIF)$

Let's do it part by part.

([a-zA-Z]:)

This requires the file path starts with a driveletter like C:, d:, etc.

(\\{2}\w+)\$?)

\\{2} means the backslash repeated twice (note the \ needs to be escaped), followed by some alphanumerics (\w+), and then maybe a dollar sign (\$?). This is the host part of UNC path.

([a-zA-Z]:)|(\\{2}\w+)\$?)

The | means "or". So either starts with a drive letter or an UNC path. Congratulations for kicking out non-Windows users.

(\\(\w[\w].*))

This should the directory part of the path, but actually is 2 alphanumerics followed by anything except new lines (.*), like \ab!@#*(#$*).

The proper regex for this part should be (?:\\\w+)+

(.jpg|.JPG|.gif|.GIF)$

This means the last 3 characters of the path must be jpg, JPG, gif or GIF. Note that . is not a dot, but matches anything except \n, so a filename like haha.abcgif or malicious.exe\0gif will pass.

The proper regex for this part should be \.(?:jpg|JPG|gif|GIF)$

Together,

^(([a-zA-Z]:)|(\\{2}\w+)\$?)(\\(\w[\w].*))(.jpg|.JPG|.gif|.GIF)$

will match

D:\foo.jpg
\\remote$\dummy\..\C:\Windows\System32\Logo.gif
C:\Windows\System32\cmd.exe;--gif

and will fail

/home/user/pictures/myself.jpg
C:\a.jpg
C:\d\e.jpg

The proper regex is /\.(?:jpg|gif)$/i, and check whether the uploaded file is really an image on the server side.

like image 35
kennytm Avatar answered Oct 22 '22 23:10

kennytm