Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex for no space between attributes html

Tags:

html

regex

How to detected no space between attributes. Example:

 <div style="margin:37px;"/></div>
 <span title=''style="margin:37px;" /></span>
 <span title="" style="margin:37px;" /></span>
 <a title="u" hghghgh  title="j" >

 <a title=""gg  ff>

correct: 1,3,4 incorrect: 2,5 How to detected incorrect?

I've tried with this:

<(.*?=(['"]).*?\2)([\S].*)|(^/)>

But it's not working.

like image 650
wroe12 Avatar asked Dec 30 '15 18:12

wroe12


2 Answers

You should not use regex to parse HTML, unless for learning purpose.


http://regexr.com/3cge1

<\w+(\s+[\w-]+(=(['"]?)[^"']*\3)?)*\s*/?>

This regular expression matches even if you don't have any attribute at all. It works for self-closing tags, and if the attribute has no value.


  • <\w+ Match opening < and \w characters.

  • (\s+[\w-]+(=(['"])[^"']*\3)?)* zero or more attributes that must start with a white space. It contains two parts:

    • \s+[\w-]+ attribute name after mandatory space
    • (=(['"])[^"']*\3)? optional attribute value
  • \s*/?> optional white space and optional / followed by closing >.


Here is a test for the strings:

var re = /<\w+(\s+[\w-]+(=(['"]?)[^"']*\3)?)*\s*\/?>/g;

! '<div style="margin:37px;"/></div>'.match(re);
false

! '<span title=\'\'style="margin:37px;" /></span>'.match(re);
true

! '<span title="" style="margin:37px;" /></span>'.match(re);
false

! '<a title="u" hghghgh  title="j" >'.match(re);
false

! '<a title=""gg  ff>'.match(re);
true

Display all incorrect tags:

var html = '<div style="margin:37px;"></div> <span title=\'\'style="margin:37px;"/><a title=""gg ff/> <span title="" style="margin:37px;" /></span> <a title="u" hghghgh title="j"example> <a title=""gg ff>';
var tagRegex = /<\w+[^>]*\/?>/g;
var validRegex = /<\w+(\s+[\w-]+(=(['"]?)[^"']*\3)?)*\s*\/?>/g;

html.match(tagRegex).forEach(function(m) {
  if(!m.match(validRegex)) {
    console.log('Incorrect', m);
  }
});

Will output

Incorrect <span title=''style="margin:37px;"/>
Incorrect <a title=""gg ff/>
Incorrect <a title="u" hghghgh title="j"example>
Incorrect <a title=""gg ff>

Update for the comments

<\w+(\s+[\w-]+(="[^"]*"|='[^']*'|=[\w-]+)?)*\s*/?>
like image 76
sina Avatar answered Sep 18 '22 01:09

sina


I got this pattern to work, finding incorrect lines 2 and 5 as you requested:

>>> import re
>>> p = r'<[a-z]+\s[a-z]+=[\'\"][\w;:]*[\"\'][\w]+.*'

>>> html = """
 <div style="margin:37px;"/></div>
 <span title=''style="margin:37px;" /></span>
 <span title="" style="margin:37px;" /></span>
 <a title="u" hghghgh  title="j" >

 <a title=""gg  ff>
"""

>>> bad = re.findall(p, html)
>>> print '\n'.join(bad)
<span title=''style="margin:37px;" /></span>
<a title=""gg  ff>

regex broken down:

p = r'<[a-z]+\s[a-z]+=[\'\"][\w;:]*[\"\'][\w]+.*'

< - starting bracket

[a-z]+\s - 1 or more lowercase letters followed by a space

[a-z]+= - 1 or more lowercase letters followed by an equals sign

[\'\"] - match a single or double quote one time

[\w;:]* - match an alphnumeric character (a-zA-Z0-9_) or a colon or semi-colon 0 or more times

[\"\'] - again match a single or double quote one time

[\w]+ - match an alphanumeric character one or more times(this catches the lack of a space you wanted to detect) ***

.* - match anything 0 or more times(gets rest of the line)

like image 45
Totem Avatar answered Sep 20 '22 01:09

Totem