I need a regex or a function in PHP that will validate a string to be a good XML element name.
Form w3schools:
XML elements must follow these naming rules:
- Names can contain letters, numbers, and other characters
- Names cannot start with a number or punctuation character
- Names cannot start with the letters xml (or XML, or Xml, etc)
- Names cannot contain spaces
I can write a basic regex that will check for rules 1,2 and 4, but it won't account for all punctuation allowed and won't account for 3rd rule
\w[\w0-9-]
Here is the more authoritative source for well-formed XML Element names:
Names and Tokens
NameStartChar ::=
":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] |
[#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] |
[#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] |
[#x10000-#xEFFFF]
NameChar ::=
NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
Name ::=
NameStartChar (NameChar)*
Also a separate non-tokenized rule is specified:
Names beginning with the string "xml", or with any string which would match (('X'|'x') ('M'|'m') ('L'|'l')), are reserved for standardization in this or future versions of this specification.
XML Naming RulesElement names must start with a letter or underscore. Element names cannot start with the letters xml (or XML, or Xml, etc) Element names can contain letters, digits, hyphens, underscores, and periods. Element names cannot contain spaces.
To validate the content of XML file, first create an XML text reader to work file and use reader to initialize instance of validating reader. It can be initialized by living instance of xmlReader classs. The below content is looped to validate the XML document, public bool ValidateDocument ( string fileName )
Underscores are commonly used in variable names in many programming languages and they can be useful in XML as well. Because you can't have spaces in element names, the underscore is commonly used in place of a space—for example, <first_name>.
If you want to create valid XML, use the DOM Extension. This way you don't have to bother about any Regex. If you try to put in an invalid name to a DomElement, you'll get an error.
function isValidXmlName($name) { try { new DOMElement($name); return TRUE; } catch(DOMException $e) { return FALSE; } }
This will give
var_dump( isValidXmlName('foo') ); // true valid localName var_dump( isValidXmlName(':foo') ); // true valid localName var_dump( isValidXmlName(':b:c') ); // true valid localName var_dump( isValidXmlName('b:c') ); // false assumes QName
and is likely good enough for what you want to do.
Note the distinction between localName and QName. ext/dom assumes you are using a namespaced element if there is a prefix before the colon, which adds constraints to how the name may be formed. Technically, b:b is a valid local name though because NameStartChar is part of NameChar. If you want to include these, change the function to
function isValidXmlName($name) { try { new DOMElement( $name, null, strpos($name, ':') >= 1 ? 'http://example.com' : null ); return TRUE; } catch(DOMException $e) { return FALSE; } }
Note that elements may start with "xml". W3schools (who is not affiliated with the W3c) apparently got this part wrong (wouldn't be the first time). If you really want to exclude elements starting with xml add
if(stripos($name, 'xml') === 0) return false;
before the try/catch
.
This has been missed so far despite the fact the question is that old: Name validation via PHP's pcre functions that are streamlined with the XML specification.
XML's definition is pretty clear about the element name in it's specs (Extensible Markup Language (XML) 1.0 (Fifth Edition)):
[4] NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
[4a] NameChar ::= NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
[5] Name ::= NameStartChar (NameChar)*
This notation can be transposed into a UTF-8 compatible regular expression to be used with preg_match
, here as single-quoted PHP string to be copied verbatim:
'~^[:A-Z_a-z\\xC0-\\xD6\\xD8-\\xF6\\xF8-\\x{2FF}\\x{370}-\\x{37D}\\x{37F}-\\x{1FFF}\\x{200C}-\\x{200D}\\x{2070}-\\x{218F}\\x{2C00}-\\x{2FEF}\\x{3001}-\\x{D7FF}\\x{F900}-\\x{FDCF}\\x{FDF0}-\\x{FFFD}\\x{10000}-\\x{EFFFF}][:A-Z_a-z\\xC0-\\xD6\\xD8-\\xF6\\xF8-\\x{2FF}\\x{370}-\\x{37D}\\x{37F}-\\x{1FFF}\\x{200C}-\\x{200D}\\x{2070}-\\x{218F}\\x{2C00}-\\x{2FEF}\\x{3001}-\\x{D7FF}\\x{F900}-\\x{FDCF}\\x{FDF0}-\\x{FFFD}\\x{10000}-\\x{EFFFF}.\\-0-9\\xB7\\x{0300}-\\x{036F}\\x{203F}-\\x{2040}]*$~u'
Or as another variant with named subpatterns in a more readable fashion:
'~
# XML 1.0 Name symbol PHP PCRE regex <http://www.w3.org/TR/REC-xml/#NT-Name>
(?(DEFINE)
(?<NameStartChar> [:A-Z_a-z\\xC0-\\xD6\\xD8-\\xF6\\xF8-\\x{2FF}\\x{370}-\\x{37D}\\x{37F}-\\x{1FFF}\\x{200C}-\\x{200D}\\x{2070}-\\x{218F}\\x{2C00}-\\x{2FEF}\\x{3001}-\\x{D7FF}\\x{F900}-\\x{FDCF}\\x{FDF0}-\\x{FFFD}\\x{10000}-\\x{EFFFF}])
(?<NameChar> (?&NameStartChar) | [.\\-0-9\\xB7\\x{0300}-\\x{036F}\\x{203F}-\\x{2040}])
(?<Name> (?&NameStartChar) (?&NameChar)*)
)
^(?&Name)$
~ux'
Note that this pattern contains the colon :
which you might want to exclude (two appereances in the first pattern, one in the second) for XML Namespace validation reasons (e.g. a test for NCName
).
Usage Example:
$name = '::...';
$pattern = '~
# XML 1.0 Name symbol PHP PCRE regex <http://www.w3.org/TR/REC-xml/#NT-Name>
(?(DEFINE)
(?<NameStartChar> [:A-Z_a-z\\xC0-\\xD6\\xD8-\\xF6\\xF8-\\x{2FF}\\x{370}-\\x{37D}\\x{37F}-\\x{1FFF}\\x{200C}-\\x{200D}\\x{2070}-\\x{218F}\\x{2C00}-\\x{2FEF}\\x{3001}-\\x{D7FF}\\x{F900}-\\x{FDCF}\\x{FDF0}-\\x{FFFD}\\x{10000}-\\x{EFFFF}])
(?<NameChar> (?&NameStartChar) | [.\\-0-9\\xB7\\x{0300}-\\x{036F}\\x{203F}-\\x{2040}])
(?<Name> (?&NameStartChar) (?&NameChar)*)
)
^(?&Name)$
~ux';
$valid = 1 === preg_match($pattern, $name); # bool(true)
The saying that an element name starting with XML
(in lower or uppercase letters) would not be possible is not correct. <XML/>
is a perfectly well-formed XML and XML
is a perfectly well-formed element name.
It is just that such names are in the subset of well-formed element names that are reserved for standardization (XML version 1.0 and above). It is easy to test if a (well-formed) element name is reserved with a string comparison:
$reserved = $valid && 0 === stripos($name, 'xml'));
or alternatively another regular expression:
$reserved = $valid && 1 === preg_match('~^[Xx][Mm][Ll]~', $name);
PHP's DOMDocument
can not test for reserved names at least I don't know any way how to do that and I've been looking a lot.
A valid element name needs a Unique Element Type Declaration which seems to be out of the scope of the question here as no such declaration has been provided. Therefore the answer does not take care of that. If there would be an element type declaration, you would only need to validate against a white-list of all (case-sensitive) names, so this would be a simple case-sensitive string-comparison.
Excursion: What does DOMDocument
do different to the Regular Expression?
In comparison with a DOMDocument
/ DOMElement
, there are some differences what qualifies a valid element name. The DOM extension is in some kind of mixed-mode which makes it less predictable what it validates. The following excursion illustrates the behavior and shows how to control it.
Let's take $name
and instantiate an element:
$element = new DOMElement($name);
The outcome depends:
Name
symbol.QName
symbol
So the first character decides about the comparison mode.
A regular expression is specifically written what to check for, here the XML 1.0 Name
symbol.
You can achieve the same with DOMElement
by prefixing the name with a colon:
function isValidXmlName($name)
{
try {
new DOMElement(":$name");
return TRUE;
} catch (DOMException $e) {
return FALSE;
}
}
To explicitly check for the QName
this can be achieved by turning it into a PrefixedName
in case it is a UnprefixedName
:
function isValidXmlnsQname($qname)
{
$prefixedName = (!strpos($qname, ':') ? 'prefix:' : '') . $qname;
try {
new DOMElement($prefixedName, NULL, 'uri:ns');
return TRUE;
} catch (DOMException $e) {
return FALSE;
}
}
How about
/\A(?!XML)[a-z][\w0-9-]*/i
Usage:
if (preg_match('/\A(?!XML)[a-z][\w0-9-]*/i', $subject)) {
# valid name
} else {
# invalid name
}
Explanation:
\A Beginning of the string
(?!XML) Negative lookahead (assert that it is impossible to match "XML")
[a-z] Match a non-digit, non-punctuation character
[\w0-9-]* Match an arbitrary number of allowed characters
/i make the whole thing case-insensitive
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With