How to detected no space between attributes. Example:
<div style="margin:37px;"/></div>
<span title=''style="margin:37px;" /></span>
<span title="" style="margin:37px;" /></span>
<a title="u" hghghgh title="j" >
<a title=""gg ff>
correct: 1,3,4
incorrect: 2,5
How to detected incorrect?
I've tried with this:
<(.*?=(['"]).*?\2)([\S].*)|(^/)>
But it's not working.
You should not use regex to parse HTML, unless for learning purpose.
<\w+(\s+[\w-]+(=(['"]?)[^"']*\3)?)*\s*/?>
This regular expression matches even if you don't have any attribute at all. It works for self-closing tags, and if the attribute has no value.
<\w+
Match opening <
and \w
characters.
(\s+[\w-]+(=(['"])[^"']*\3)?)*
zero or more attributes that must start with a white space. It contains two parts:
\s+[\w-]+
attribute name after mandatory space(=(['"])[^"']*\3)?
optional attribute value\s*/?>
optional white space and optional /
followed by closing >
.
Here is a test for the strings:
var re = /<\w+(\s+[\w-]+(=(['"]?)[^"']*\3)?)*\s*\/?>/g;
! '<div style="margin:37px;"/></div>'.match(re);
false
! '<span title=\'\'style="margin:37px;" /></span>'.match(re);
true
! '<span title="" style="margin:37px;" /></span>'.match(re);
false
! '<a title="u" hghghgh title="j" >'.match(re);
false
! '<a title=""gg ff>'.match(re);
true
var html = '<div style="margin:37px;"></div> <span title=\'\'style="margin:37px;"/><a title=""gg ff/> <span title="" style="margin:37px;" /></span> <a title="u" hghghgh title="j"example> <a title=""gg ff>';
var tagRegex = /<\w+[^>]*\/?>/g;
var validRegex = /<\w+(\s+[\w-]+(=(['"]?)[^"']*\3)?)*\s*\/?>/g;
html.match(tagRegex).forEach(function(m) {
if(!m.match(validRegex)) {
console.log('Incorrect', m);
}
});
Will output
Incorrect <span title=''style="margin:37px;"/>
Incorrect <a title=""gg ff/>
Incorrect <a title="u" hghghgh title="j"example>
Incorrect <a title=""gg ff>
<\w+(\s+[\w-]+(="[^"]*"|='[^']*'|=[\w-]+)?)*\s*/?>
I got this pattern to work, finding incorrect lines 2 and 5 as you requested:
>>> import re
>>> p = r'<[a-z]+\s[a-z]+=[\'\"][\w;:]*[\"\'][\w]+.*'
>>> html = """
<div style="margin:37px;"/></div>
<span title=''style="margin:37px;" /></span>
<span title="" style="margin:37px;" /></span>
<a title="u" hghghgh title="j" >
<a title=""gg ff>
"""
>>> bad = re.findall(p, html)
>>> print '\n'.join(bad)
<span title=''style="margin:37px;" /></span>
<a title=""gg ff>
regex broken down:
p = r'<[a-z]+\s[a-z]+=[\'\"][\w;:]*[\"\'][\w]+.*'
<
- starting bracket
[a-z]+\s
- 1 or more lowercase letters followed by a space
[a-z]+=
- 1 or more lowercase letters followed by an equals sign
[\'\"]
- match a single or double quote one time
[\w;:]*
- match an alphnumeric character (a-zA-Z0-9_) or a colon or semi-colon 0 or more times
[\"\']
- again match a single or double quote one time
[\w]+
- match an alphanumeric character one or more times(this catches the lack of a space you wanted to detect) ***
.*
- match anything 0 or more times(gets rest of the line)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With