Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is something wrong with my regex?

Tags:

regex

xml

I made an XML Schema and I have this in it.

<xs:element name="Email">
        <xs:simpleType>
          <xs:restriction base="xs:string">
            <xs:pattern value="\w+([-+.']\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*"/>
          </xs:restriction>
        </xs:simpleType>
      </xs:element>

Some of my emails in one of my XML documents fail and I get this error

Email' element is invalid - The value '[email protected]' is invalid according to its datatype 'String' - The Pattern constraint failed. LineNumber: 15404 LinePostion: 32

So just looking at all the emails that passed and the ones that failed I noticed that all the ones that failed have an "_(underscore)". So I am unsure if this is the reason or not.

Edit

So I changed my regex to this

 <xs:pattern value="[\w_]+([-+.'][\w_]+)*@[\w_]+([-.][\w_]+)*\.[\w_]+([-.][\w_]+)*"/>

It now works but don't understand why \w is not capturing it.

like image 1000
chobo2 Avatar asked Nov 06 '25 17:11

chobo2


1 Answers

The W3C Recommendation on datatypes defines \w as:

[#X0000-#x10FFFF]-[\p{P}\p{Z}\p{C}] (all characters except the set of "punctuation", "separator" and "other" characters)*

The underscore character definition in Unicode is 'LOW LINE' (U+005F), category: punctuation, connector [Pc]

so XML Schema handles character classes more in accordance with Unicode definitions.

But for e-mail regexp, you shold use strict ASCII, like [0-9A-Za-z_-] intead of \w (I bet email address with nonlatin characters is invalid :) ), yet better is to find a proven regexp syntax, or look into RFC, what is the proper e-mail format

like image 85
mykhal Avatar answered Nov 09 '25 12:11

mykhal