First of all, I know how most RegExp questions go; and this is not one of those, "please write my code" questions.
My confusion lies in the fact that my RegExp
works on regexr, and in chrome's dev tools when polling the document.body.textContent
, but not on an HTML file after I have read it in io.js.
io.js is version 1.5.1, running on windows 8
Why would it work in both places listed, but not in io.js? Am I not taking something into consideration that io.js does to read files?
My RegExp
should be matching "@{each ___->___} text and line breaks @{/each}
" as it does in the link below, but instead, it returns null
Here is what I'm trying to use: http://regexr.com/3aldk
RegExp:
/@\{each ([a-zA-Z0-9->.]*)\}([\s\S]*)@\{\/each}/g
JS (Example):
fs.readFile('view.html', {encoding:'utf8'}, function(error, html) {
console.log(html.match(myRegExp)); // null
});
HTML:
<!doctype html>
<html>
<head>
<title>@{title}</title>
</head>
<body>
<h1>@{foo.bar}</h1>
<p>
Lorem ipsum dolor sit amet, @{foo.baz.hoo}
</p>
@{each people->person}
<div>
<b>@{person.name}:</b> @{person.age}
</div>
@{/each}
</body>
</html>
Am I missing something obvious, like a character, that is present on the back side, but not once served?
Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts.
While arbitrary HTML with only a regex is impossible, it's sometimes appropriate to use them for parsing a limited, known set of HTML. If you have a small set of HTML pages that you want to scrape data from and then stuff into a database, regexes might work fine.
Allows ASCII codes to be used in regular expressions. \x n. Matches n, where n is a hexadecimal escape value. Hexadecimal escape values must be exactly two digits long. For example, \x41 matches A .
A regular expression is a pattern of characters. The pattern is used to do pattern-matching "search-and-replace" functions on text.
The issue here lies on the fine line between specification and implementations.
ECMAscript 5.1 Specification states that:
A
-
character can be treated literally or it can denote a range. It is treated literally if it is the first or last character of ClassRanges, the beginning or end limit of a range specification, or immediately follows a range specification.
Regular-Expressions.info notes that:
Hyphens at other positions in character classes where they can't form a range may be interpreted as literals or as errors. Regex flavors are quite inconsistent about this.
The safe way of including a dash -
minus sign in a character class is by either:
[a-zA-Z0-9\->.]
)[-.>a-zA-Z0-9]
)
^
(eg. [^-.>a-zA-Z0-9]
)[a-zA-Z0-9.>-]
)General coding guidelines suggest placing your ranges first and ending the character class with the hyphen, this avoids ambiguity and helps readability.
Summing it up, your RegEx should become:
/@\{each ([a-zA-Z0-9>.-]*)\}([\s\S]*)@\{\/each}/g
As an additional tip:
you could also rewrite [\s\S]
(any whitespace char. or any non-whitespace char.) into [^]
(not nothing)
which would end you up with the following RegEx:
/@\{each ([a-zA-Z0-9>.-]*)\}([^]*)@\{\/each}/g
JavaScript ... treats
[^]
as a negated empty character class that matches any single character. - source
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With