Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use Regular Expression to match the charset string in HTML?

Tags:

html

regex

HTML code example:

<meta http-equiv="Content-type" content="text/html;charset=utf-8" />

I want to use RegEx to extract the charset information (i.e. here, it's "utf-8")

(I'm using C#)

like image 1000
silent Avatar asked Aug 11 '10 12:08

silent


2 Answers

My answer provides a more robust version of @Floyd's and, to the degree possible, addresses @You's breakage test case, where a negative lookahead is used to avoid it. There's really only one relevant case I can think of (a variant of @You's example) where it will give a false positive, but I think it would be pretty rare. Expressions are expected to be run with the case-insensitive flag and were tested using java.util.regex and JRegex.

Capture groups are automatically trimmed and never include quotes, nor other tag chars like "/" or ">". In the second expression, there are 2 capture groups; the first being the content-type value, which may be empty (i.e., when using charset attribtue), and the second being the charset value, which will always be non-empty (unless the charset value is literally left empty for some odd reason).

Regex for matching/grouping charset value only - trimmed, skips quotes

<meta(?!\s*(?:name|value)\s*=)[^>]*?charset\s*=[\s"']*([^\s"'/>]*)

Same as above, but also matches/groups content-type (optional) and charset (required) values, trimmed, skips quotes. Minor caveat - Misses matching standalone content type value, i.e., "text/html"

<meta(?!\s*(?:name|value)\s*=)(?:[^>]*?content\s*=[\s"']*)?([^>]*?)[\s"';]*charset\s*=[\s"']*([^\s"'/>]*)

Test cases (all pass except the very last one)...

<meta http-equiv="Content-Type" content="text/html;charset=iso-8859-1"/>
<meta http-equiv="Content-Type" content="text/html;charset=iso-8859-1" />
<meta http-equiv='Content-Type' content='text/html;charset=iso-8859-1'/>
<meta http-equiv='Content-Type' content='text/html;charset=iso-8859-1' />
<meta http-equiv=Content-Type content=text/html;charset=iso-8859-1/>
<meta http-equiv=Content-Type content=text/html;charset=iso-8859-1 />
<meta http-equiv="Content-Type" content="text/html;charset=iso-8859-1">
<meta http-equiv="Content-Type" content="text/html;charset=iso-8859-1" >
<meta http-equiv='Content-Type' content='text/html;charset=iso-8859-1'>
<meta http-equiv='Content-Type' content='text/html;charset=iso-8859-1' >
<meta http-equiv=Content-Type content=text/html;charset=iso-8859-1>
<meta http-equiv=Content-Type content=text/html;charset=iso-8859-1 >

<meta http-equiv="Content-Type" content="text/html;charset='iso-8859-1'">
<meta http-equiv="Content-Type" content="'text/html;charset=iso-8859-1'">
<meta http-equiv="Content-Type" content="'text/html';charset='iso-8859-1'">
<meta http-equiv='Content-Type' content='text/html;charset="iso-8859-1"'>
<meta http-equiv='Content-Type' content='"text/html;charset=iso-8859-1"'>
<meta http-equiv='Content-Type' content='"text/html";charset="iso-8859-1"'>

<meta http-equiv="Content-Type" content="text/html;;;charset=iso-8859-1">
<meta http-equiv="Content-Type" content="text/html;;;charset='iso-8859-1'">
<meta http-equiv="Content-Type" content="'text/html;;;charset=iso-8859-1'">
<meta http-equiv="Content-Type" content="'text/html';;;charset='iso-8859-1'">
<meta http-equiv='Content-Type' content='text/html;;;charset=iso-8859-1'>
<meta http-equiv='Content-Type' content='text/html;;;charset="iso-8859-1"'>
<meta http-equiv='Content-Type' content='"text/html;;;charset=iso-8859-1"'>
<meta http-equiv='Content-Type' content='"text/html";;;charset="iso-8859-1"'>

<meta  http-equiv  =  "  Content-Type  "  content  =  "  '  text/html  '  ;  ;;  '  ;  '  '  ;  '  ;  ' ;;  ;  charset  =  '  iso-8859-1  '  "  >
<meta  content  =  "  '  text/html  '  ;  ;;  '  ;  '  '  ;  '  ;  ' ;;  ;  charset  =  '  iso-8859-1  '  "  http-equiv  =  "  Content-Type  "  >
<meta  http-equiv  =  Content-Type  content  =  text/html;charset=iso-8859-1  >
<meta  content  =  text/html;charset=iso-8859-1  http-equiv  =  Content-Type  >
<meta  http-equiv  =  Content-Type  content  =  text/html  ;  charset  =  iso-8859-1  >
<meta  content  =  text/html  ;  charset  =  iso-8859-1  http-equiv  =  Content-Type  >
<meta  http-equiv  =  Content-Type  content  =  text/html  ;;;  charset  =  iso-8859-1  >
<meta  content  =  text/html  ;;;  charset  =  iso-8859-1  http-equiv  =  Content-Type  >
<meta  http-equiv  =  Content-Type  content  =  text/html  ;  ;  ;  charset  =  iso-8859-1  >
<meta  content  =  text/html  ;  ;  ;  charset  =  iso-8859-1  http-equiv  =  Content-Type  >

<meta charset="utf-8"/>
<meta charset="utf-8" />
<meta charset='utf-8'/>
<meta charset='utf-8' />
<meta charset=utf-8/>
<meta charset=utf-8 />
<meta charset="utf-8">
<meta charset="utf-8" >
<meta charset='utf-8'>
<meta charset='utf-8' >
<meta charset=utf-8>
<meta charset=utf-8 >

<meta  charset  =  "  utf-8  "  >
<meta  charset  =  '  utf-8  '  >
<meta  charset  =  "  utf-8  '  >
<meta  charset  =  '  utf-8  "  >
<meta  charset  =  "  utf-8     >
<meta  charset  =  '  utf-8     >
<meta  charset  =     utf-8  '  >
<meta  charset  =     utf-8  "  >
<meta  charset  =     utf-8     >
<meta  charset  =     utf-8    />

<meta name="title" value="charset=utf-8 — is it really useful (yep)?">
<meta value="charset=utf-8 — is it really useful (yep)?" name="title">
<meta name="title" content="charset=utf-8 — is it really useful (yep)?">
<meta name="charset=utf-8" content="charset=utf-8 — is it really useful (yep)?">

<meta content="charset=utf-8 — is it really useful (nope, not here, but gotta admit pretty robust otherwise)?" name="title">
like image 187
sisu Avatar answered Nov 10 '22 22:11

sisu


This regex:

<meta.*?charset=([^"']+)

Should work. Using an XML parser to extract this is overkill.

like image 37
NullUserException Avatar answered Nov 10 '22 23:11

NullUserException