HTML code example:
<meta http-equiv="Content-type" content="text/html;charset=utf-8" />
I want to use RegEx to extract the charset information (i.e. here, it's "utf-8")
(I'm using C#)
My answer provides a more robust version of @Floyd's and, to the degree possible, addresses @You's breakage test case, where a negative lookahead is used to avoid it. There's really only one relevant case I can think of (a variant of @You's example) where it will give a false positive, but I think it would be pretty rare. Expressions are expected to be run with the case-insensitive flag and were tested using java.util.regex and JRegex.
Capture groups are automatically trimmed and never include quotes, nor other tag chars like "/" or ">". In the second expression, there are 2 capture groups; the first being the content-type value, which may be empty (i.e., when using charset attribtue), and the second being the charset value, which will always be non-empty (unless the charset value is literally left empty for some odd reason).
Regex for matching/grouping charset value only - trimmed, skips quotes
<meta(?!\s*(?:name|value)\s*=)[^>]*?charset\s*=[\s"']*([^\s"'/>]*)
Same as above, but also matches/groups content-type (optional) and charset (required) values, trimmed, skips quotes. Minor caveat - Misses matching standalone content type value, i.e., "text/html"
<meta(?!\s*(?:name|value)\s*=)(?:[^>]*?content\s*=[\s"']*)?([^>]*?)[\s"';]*charset\s*=[\s"']*([^\s"'/>]*)
Test cases (all pass except the very last one)...
<meta http-equiv="Content-Type" content="text/html;charset=iso-8859-1"/>
<meta http-equiv="Content-Type" content="text/html;charset=iso-8859-1" />
<meta http-equiv='Content-Type' content='text/html;charset=iso-8859-1'/>
<meta http-equiv='Content-Type' content='text/html;charset=iso-8859-1' />
<meta http-equiv=Content-Type content=text/html;charset=iso-8859-1/>
<meta http-equiv=Content-Type content=text/html;charset=iso-8859-1 />
<meta http-equiv="Content-Type" content="text/html;charset=iso-8859-1">
<meta http-equiv="Content-Type" content="text/html;charset=iso-8859-1" >
<meta http-equiv='Content-Type' content='text/html;charset=iso-8859-1'>
<meta http-equiv='Content-Type' content='text/html;charset=iso-8859-1' >
<meta http-equiv=Content-Type content=text/html;charset=iso-8859-1>
<meta http-equiv=Content-Type content=text/html;charset=iso-8859-1 >
<meta http-equiv="Content-Type" content="text/html;charset='iso-8859-1'">
<meta http-equiv="Content-Type" content="'text/html;charset=iso-8859-1'">
<meta http-equiv="Content-Type" content="'text/html';charset='iso-8859-1'">
<meta http-equiv='Content-Type' content='text/html;charset="iso-8859-1"'>
<meta http-equiv='Content-Type' content='"text/html;charset=iso-8859-1"'>
<meta http-equiv='Content-Type' content='"text/html";charset="iso-8859-1"'>
<meta http-equiv="Content-Type" content="text/html;;;charset=iso-8859-1">
<meta http-equiv="Content-Type" content="text/html;;;charset='iso-8859-1'">
<meta http-equiv="Content-Type" content="'text/html;;;charset=iso-8859-1'">
<meta http-equiv="Content-Type" content="'text/html';;;charset='iso-8859-1'">
<meta http-equiv='Content-Type' content='text/html;;;charset=iso-8859-1'>
<meta http-equiv='Content-Type' content='text/html;;;charset="iso-8859-1"'>
<meta http-equiv='Content-Type' content='"text/html;;;charset=iso-8859-1"'>
<meta http-equiv='Content-Type' content='"text/html";;;charset="iso-8859-1"'>
<meta http-equiv = " Content-Type " content = " ' text/html ' ; ;; ' ; ' ' ; ' ; ' ;; ; charset = ' iso-8859-1 ' " >
<meta content = " ' text/html ' ; ;; ' ; ' ' ; ' ; ' ;; ; charset = ' iso-8859-1 ' " http-equiv = " Content-Type " >
<meta http-equiv = Content-Type content = text/html;charset=iso-8859-1 >
<meta content = text/html;charset=iso-8859-1 http-equiv = Content-Type >
<meta http-equiv = Content-Type content = text/html ; charset = iso-8859-1 >
<meta content = text/html ; charset = iso-8859-1 http-equiv = Content-Type >
<meta http-equiv = Content-Type content = text/html ;;; charset = iso-8859-1 >
<meta content = text/html ;;; charset = iso-8859-1 http-equiv = Content-Type >
<meta http-equiv = Content-Type content = text/html ; ; ; charset = iso-8859-1 >
<meta content = text/html ; ; ; charset = iso-8859-1 http-equiv = Content-Type >
<meta charset="utf-8"/>
<meta charset="utf-8" />
<meta charset='utf-8'/>
<meta charset='utf-8' />
<meta charset=utf-8/>
<meta charset=utf-8 />
<meta charset="utf-8">
<meta charset="utf-8" >
<meta charset='utf-8'>
<meta charset='utf-8' >
<meta charset=utf-8>
<meta charset=utf-8 >
<meta charset = " utf-8 " >
<meta charset = ' utf-8 ' >
<meta charset = " utf-8 ' >
<meta charset = ' utf-8 " >
<meta charset = " utf-8 >
<meta charset = ' utf-8 >
<meta charset = utf-8 ' >
<meta charset = utf-8 " >
<meta charset = utf-8 >
<meta charset = utf-8 />
<meta name="title" value="charset=utf-8 — is it really useful (yep)?">
<meta value="charset=utf-8 — is it really useful (yep)?" name="title">
<meta name="title" content="charset=utf-8 — is it really useful (yep)?">
<meta name="charset=utf-8" content="charset=utf-8 — is it really useful (yep)?">
<meta content="charset=utf-8 — is it really useful (nope, not here, but gotta admit pretty robust otherwise)?" name="title">
This regex:
<meta.*?charset=([^"']+)
Should work. Using an XML parser to extract this is overkill.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With